This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
MicroBenchmarks/
-
CMakeLists.txt
-
harris/
-
CMakeLists.txt
1/1
harris.h
-
harris.reference_output
-
harrisKernel.cpp
2/7
main.cpp

Differential D47675

[test-suite][RFC] Using Google Benchmark Library on Harris Kernel
ClosedPublic

Authored by proton on Jun 2 2018, 5:06 AM.

Download Raw Diff

Details

Reviewers

dberris
Meinersbur
homerdin
MatzeB
hfinkel
cmatthews
kristof.beyls

Commits

rOLDT335611: [test-suite] Using Google Benchmark Library on Harris Kernel
rL335611: [test-suite] Using Google Benchmark Library on Harris Kernel

Summary

Hi,
I have used google benchmark library on Harris corner detection kernel (from polymage benchmarks). I want to know if the way I have used the google benchmark library here is correct or there can be a better way to do this.

Diff Detail

Event Timeline

proton created this revision.Jun 2 2018, 5:06 AM

Herald added subscribers: llvm-commits, mgorny. · View Herald TranscriptJun 2 2018, 5:06 AM

Did you consider putting the driver code (initialization, malloc, free, checking result), Into a different file than the kernel code? This would allow measuring it (code size, LLVM statistics, compile time, etc.) independently from the boilerplate code.

MicroBenchmarks/harris/harris.cpp
177–197 ↗	(On Diff #149608)	Maybe the malloc/free calls should be taken out of the measured kernel.
201–208 ↗	(On Diff #149608)	Could you rewrite this to use multidimensional access subscripts? E.g. `img[_i0-1][_i1-1]`.
MicroBenchmarks/harris/sha1.hpp
1–45 ↗	(On Diff #149608)	There is already hashing used by test-suite (see `HashProgramOutput.sh`), why add another one?

proton added reviewers: MatzeB, hfinkel, cmatthews, kristof.beyls.Jun 5 2018, 12:14 PM

proton updated this revision to Diff 150032.Jun 5 2018, 1:16 PM

proton updated this revision to Diff 150815.Jun 11 2018, 12:40 PM

proton edited the summary of this revision. (Show Details)

Looks great. Did you do a performance comparison with/without Polly?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

[comment] Is there are a reason why the init.cpp and main.cpp are are separate files?

MicroBenchmarks/harris/harris.h
34	[style] `DUMP_IMAGE` is named like a macro, but is a function.
MicroBenchmarks/harris/harris_kernel.cpp
118 ↗	(On Diff #150815)	[style] `return` before the end of the function seem unnecessary.

proton updated this revision to Diff 151105.Jun 13 2018, 12:11 AM

In D47675#1129087, @Meinersbur wrote:

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

In D47675#1130720, @proton wrote:

In D47675#1129087, @Meinersbur wrote:

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

How long is a single run with O3?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

It still writes to the target array in parallel without locking.

I suggest to call harrisKernel once more only for the correctness check,.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

If it is supposed to be a shared resource, it shouldn't be in the harris directory.

Could you check whether llvm-lit correctly collects execution time,compile/link time, LLVM -stats, code size?

Updated input size, used malloc to allocate memory for the array.

In D47675#1131201, @Meinersbur wrote:

Could you check whether llvm-lit correctly collects execution time, compile/link time, LLVM -stats, code size?

I don't know how to check LLVM -stats using lit.
Sizes matches the output of llvm-size, compile time and link time are also fine.

lit Output: used "lit ." in build/Microbenchmark/harris
compile_time: 1.3595
link_time: 0.0832
exec size: 364288
exec_time: 28254000.0000
Testing Time: 1.22s
.
.
.

For now, I have merged the init.cpp and main.cpp. The image initialization here is special to visualize the output nicely. We may/may not use this initialization for other image processing kernels.
If at all there is a common image initialization source code introduced in future (maybe with multiple type of image init and some helper function that helps in image processing), there will be only minute changes to main.cpp file.

Here are some of the stats for this code:

RunTime

With Benchmark library

Flag	CPU Time	Iteration	System measured (using time ./a.out)
`O0`	117259447ns	6	1.039s
`O3`	32498956ns	21	1.214s
`O3+Polly`	32562202ns	21	1.205s

Without Benchmark library (input size is changed - refer harris.h)

Flag	System measured time
`O0`	1.241s
`O3`	0.697s
`O3+Polly`	0.611s

Note: Clang is built in debug mode and I have compiled benchmark using clang++ -O3 -mllvm -polly --std=c++11 harrisKernel.cpp main.cpp -lbenchmark -lpthread -o withPolly.out

lit --vg --vg-leak

It Fails... but so does other benchmarks in microbenchmark folder so maybe there is memory leak problem with benchmark library

I am thinking of manually verifying output as there is a pattern to output with this checkbox initialization, let me know if it is a good idea or not.
I have removed the reference output for now as it is of size 10MB, I will update the diff with reference output once the checking method is finalized.

dberris added inline comments.Jun 13 2018, 7:07 PM

MicroBenchmarks/harris/main.cpp
108–121	There's a few comments I have about this code, but let me start with the simple(r) ones: You may want to make the image sizes configurable, and using the benchmark API to try it on differently sized images (so that you can see how the algorithm scales based on input sizes). You probably want to ensure that you do something with the image data after the loop(s) to ensure that the allocations aren't optimised away. You can use the benchmark::DoNotOptimize(...) function to do some of that. You might also want to consider measuring throughput (computing the amount of data processed for the time it took per iteration). Before you get into the loop, you should probably run the kernel once on the just-allocated memory, to ensure that you're not just measuring the cost of pulling data through the cache(s). This is probably better to do with smaller images.

Using Polly's -debug-only=polly-scops output showed that the kernel is in fact not optimized:

Invalidate SCoP because of reason 0

NOTE: Run time checks for %for.cond11.preheader---%for.cond.cleanup532 could not be created as the number of parameters involved is too high. The SCoP will be dismissed.
Use:
        --polly-rtc-max-parameters=X
to adjust the maximal number of parameters but be advised that the compile time might increase exponentially.

Bailing-out because could not build alias checks

Using -polly-rtc-max-parameters=999 does not help. I remember that Polly prints that message whenever it cannot build the alias checks, even for other reasons than stated.

Runtime on this differential

Benchmark	Time	CPU	Iterations
Polly	28825	28757	25
Polly	28964	28910	24
-O3	13216	13191	54
-O3	13072	13049	54
-O0	98790	98716	7
-O0	99720	99652	7

Modified such that polly is now able to make some changes in the kernel (no runtime checks problem).

Did you check whether Polly recognizes the call to exp as part of the SCoP?

dberris added inline comments.Jun 19 2018, 10:29 PM

MicroBenchmarks/harris/main.cpp
179	It seems that HEIGHT and WIDTH are input values anyway, consider making multiple input sizes to see how the kernel performs as you scale the image size goes up. You might also not need the `__restrict__` attributes for the malloc-provided heap memory either. This means you could do: float image = reinterpret_cast<float>(malloc(sizeof(float) * (2 + state.range(0)) * (2 + state.range(1)))); When you register the benchmark, you can then provide the image sizes to test with: BENCHMARK(HarrisBenchmark) ->Unit(benchmark::kMicrosecond) ->Args({256, 256}) ->Args({512, 512}) ->Args({1024, 1024}) ->Args({2048, 2048}); You can see more options at https://github.com/google/benchmark#passing-arguments. Another thing you may consider measuring as I suggested in the past is throughput. To do that, you can call `state.SetBytesProcessed(...)` in the benchmark body, typically at the end just before exiting -- you want to essentially report something like: state.SetBytesProcessed(sizeof(float) * (state.range(0) + 2) * (state.range(1) + 2) * state.iterations()); This will add a "MB/sec" output alongside the time it took for each iteration of the benchmark.

proton updated this revision to Diff 152211.Jun 20 2018, 6:35 PM

proton marked an inline comment as done.

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

MicroBenchmarks/harris/main.cpp

112–120

Why are these still using HEIGHT and WIDTH? Why aren't these just:

const size_t height = state.range(0);
const size_t weight = state.range(1);

float **image = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width)));
float **imageOutput = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width)));

186

Can you re-format this? Preferably with clang-format if possible, so that it's easier to read.

Formatted using clang format

Results

With Polly:

BENCHMARK	Time	CPU	Iteration	Throughput
BENCHMARK_HARRIS/256/256	769 us	768 us	911	330.508MB/s
BENCHMARK_HARRIS/512/512	4001 us	3996 us	177	252.207MB/s
BENCHMARK_HARRIS/1024/1024	25690 us	25650 us	28	156.553MB/s
BENCHMARK_HARRIS/2048/2048	118023 us	117830 us	6	136.054MB/s

Without Polly:

BENCHMARK	Time	CPU	Iteration	Throughput
BENCHMARK_HARRIS/256/256	626 us	625 us	1135	406.15MB/s
BENCHMARK_HARRIS/512/512	3074 us	3068 us	229	328.51MB/s
BENCHMARK_HARRIS/1024/1024	17121 us	17086 us	42	235.022MB/s
BENCHMARK_HARRIS/2048/2048	64207 us	64077 us	11	250.189MB/s

In D47675#1138707, @dberris wrote:

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

I replied to your previous comment but for some reason, it is showing only after your inlined comment not here.

I cannot use a pointer to pointer of an array (float **) here as the compiler may think that some pointers may overlap and prevents Polly from detecting SCoPs here.

I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++

Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?

MicroBenchmarks/harris/main.cpp
179	Cannot use float *image as pointers may overlap and this prevents Polly from detecting scops. I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = reinterpret_cast<float ()[2+height][2+width]>((float ) malloc(...)); " is not allowed by clang++ I did considered adding SetBytesProcessed but I was not sure how many bytes should be written as argument (output image size or the total bytes accessed in kernel) so I commented the line "SetBytesProcessed(static_cast<int64_t>(state.iterations())WIDTHHEIGHT*50);" but forgot to ask about it.

In D47675#1139805, @proton wrote:

In D47675#1138707, @dberris wrote:

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

I replied to your previous comment but for some reason, it is showing only after your inlined comment not here.

I cannot use a pointer to pointer of an array (float **) here as the compiler may think that some pointers may overlap and prevents Polly from detecting SCoPs here.

I'm not sure that's entirely true -- the pointer is coming from malloc, so they're meant to not overlap. At least LLVM should be able to detect that.

I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++

It's not allowed because height and width are not constant expressions. Note that even in function arguments, the "array" form will be treated as pointers anyway so it shouldn't make any difference if you change the API to:

void harrisKernel(
    int height, int width, float **inputImg,
    float **outputImg, float **Ix,
    float **Iy, float **Ixx,
    float **Ixy, float **Iyy,
    float **Sxx, float **Sxy,
    float **Syy, float **det,
    float **trace)

If anything, you want to mark those pointers that the compiler is supposed to treat as non-aliasing as restrict or __restrict__.

Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?

The bytes processed is what you say it is -- as I suggested, it is based on the input size (size of the input image). If you want to measure something else, you're going to have to provide what that throughput is.

The beauty of having a benchmark suite is that you can test out these various approaches in the same benchmark. You can explore alternative strategies like:

Turn the harris kernel implementation into a template, so you can use C++ features and take const std::array<std::array<float, W>, H>&, and let the compiler deduce the sizes.
Have a version of the kernel with pointer arguments with __restrict__ and one without.
Instead of taking output pointers, consider returning a struct/tuple with std::unique_ptr<float[][]> members.
Use C++ array new and array delete (say new float*[(height *2) * (width * 2)]).

Anyway, I'm fine with the state of it currently, but would like to see more work done if not now but in the future. This will be really helpful for people working on the compiler trying to see whether performance/throughput can be improved (or regress) with changes to the compiler. I'm sure the folks working on Polly would like to be able to diagnose why the Polly version is slower than the normal build.

Please wait for someone else to LGTM/Accept before landing. I'm sure we can spend a lot more time trying to make these benchmarks better, but in the meantime having *something* in there is better than not having one in there.

MicroBenchmarks/harris/main.cpp
179	I don't know whether you want to optimise for Polly or make Polly just recognise these pointers shouldn't overlap. If Polly can't detect that these pointers are coming from different 'malloc' calls, then I suspect that's a bug in Polly rather than something you need to work around in the benchmark. Note that maybe the better thing to do is to change the kernel's API to put `restrict` or `__restrict__` on the pointers, so that the optimiser in those cases might be able to assume that the pointers don't alias and don't do anything special in this benchmark. See my top-level comment for alternatives to explore, if you're open to it.

This revision is now accepted and ready to land.Jun 21 2018, 4:37 PM

In D47675#1140100, @dberris wrote:

Please wait for someone else to LGTM/Accept before landing. I'm sure we can spend a lot more time trying to make these benchmarks better, but in the meantime having *something* in there is better than not having one in there.

I agree with @dberris. If landed we can look into addition additional such benchmarks and base patches to common structures.

homerdin added inline comments.Jun 25 2018, 11:43 AM

MicroBenchmarks/harris/main.cpp
88	This will write `output.txt` into whichever directory lit is run in. You can set the working directory that the test will run in by passing an argument to `llvm_test_run()` > `llvm_test_run(WORKDIR ${CMAKE_CURRENT_BINARY_DIR})`

added LICENSE and fixed work directory issue.

Closed by commit rL335611: [test-suite] Using Google Benchmark Library on Harris Kernel (authored by homerdin). · Explain WhyJun 26 2018, 8:12 AM

This revision was automatically updated to reflect the committed changes.

Meinersbur mentioned this in D101844: [MicroBenchmarks] Add initial loop vectorization benchmarks..May 11 2021, 9:02 AM

Revision Contents

Path

Size

MicroBenchmarks/

CMakeLists.txt

1 line

harris/

CMakeLists.txt

13 lines

harris.h

55 lines

harris.reference_output

1 line

harrisKernel.cpp

119 lines

main.cpp

251 lines

Diff 151811

MicroBenchmarks/CMakeLists.txt

Context not available.
	add_subdirectory(libs)	add_subdirectory(libs)
	add_subdirectory(XRay)	add_subdirectory(XRay)
	add_subdirectory(LCALS)	add_subdirectory(LCALS)
		add_subdirectory(harris)
	endif()	endif()
Context not available.

MicroBenchmarks/harris/CMakeLists.txt

				list(APPEND CPPFLAGS -std=c++11 )

				set(REFERENCE_OUTPUT ${CMAKE_CURRENT_SOURCE_DIR}/harris.reference_output)
				llvm_test_verify("${CMAKE_SOURCE_DIR}/HashProgramOutput.sh ${CMAKE_CURRENT_BINARY_DIR}/output.txt")
				llvm_test_verify("${FPCMP} ${CMAKE_CURRENT_BINARY_DIR}/output.txt ${REFERENCE_OUTPUT}")

				llvm_test_run()
				llvm_test_executable(harris harrisKernel.cpp main.cpp)
				target_link_libraries(harris benchmark)

MicroBenchmarks/harris/harris.h

				#ifndef __HARRIS_H__
				#define __HARRIS_H__

				#include <cstdlib>
				#include <cstring>
				#include <fstream>
				#include <iomanip>
				#include <iostream>
				#include <string>

				// ============================================================================
				// ============================================================================

				// Image Size
				// (Any box size will work)
				// This parameter is used in input // used only in init_checkboard_image
				#define BOX_SIZE 10

				/Comment this to not use google benchmark library/
				#define BENCHMARK_LIB


				// Smaller input is fine here because benchmark lib takes care of small runtimes
				#define HEIGHT 1000
				#define WIDTH 1000

				// ============================================================================
				// ============================================================================

				// Parameters For harris kernel
				#define THRESHOLD 0.1

				// ============================================================================
				// ============================================================================
				MeinersburUnsubmitted Done Reply Inline Actions [style] `DUMP_IMAGE` is named like a macro, but is a function. Meinersbur: [style] `DUMP_IMAGE` is named like a macro, but is a function.

				void initCheckboardImage(int height, int width); // Initialize a checkboard image
				void printImage(int height, int width, float img[(2 + HEIGHT)][2 + WIDTH]);
				// harris kernel from polymage_naive.cpp
				void harrisKernel(int height
				, int width
				, float inputImg[2 + HEIGHT][2 + WIDTH]
				, float outputImg[(2 + HEIGHT)][2 + WIDTH]
				, float Ix [(2 + HEIGHT)][2 + WIDTH]
				, float Iy [(2 + HEIGHT)][2 + WIDTH]
				, float Ixx [(2 + HEIGHT)][2 + WIDTH]
				, float Ixy [(2 + HEIGHT)][2 + WIDTH]
				, float Iyy [(2 + HEIGHT)][2 + WIDTH]
				, float Sxx [(2 + HEIGHT)][2 + WIDTH]
				, float Sxy [(2 + HEIGHT)][2 + WIDTH]
				, float Syy [(2 + HEIGHT)][2 + WIDTH]
				, float det [(2 + HEIGHT)][2 + WIDTH]
				, float trace [(2 + HEIGHT)][2 + WIDTH]);

				// ============================================================================
				#endif

MicroBenchmarks/harris/harris.reference_output

33f734b11d0139b6c0e68f583fc9ce3a

MicroBenchmarks/harris/harrisKernel.cpp

				#include "harris.h"


				// harris kernel from polymage_naive.cpp
				void harrisKernel(int height, int width
				, float inputImg[2 + HEIGHT][2 + WIDTH]
				, float outputImg[(2 + HEIGHT)][2 + WIDTH]
				, float Ix [(2 + HEIGHT)][2 + WIDTH]
				, float Iy [(2 + HEIGHT)][2 + WIDTH]
				, float Ixx [(2 + HEIGHT)][2 + WIDTH]
				, float Ixy [(2 + HEIGHT)][2 + WIDTH]
				, float Iyy [(2 + HEIGHT)][2 + WIDTH]
				, float Sxx [(2 + HEIGHT)][2 + WIDTH]
				, float Sxy [(2 + HEIGHT)][2 + WIDTH]
				, float Syy [(2 + HEIGHT)][2 + WIDTH]
				, float det [(2 + HEIGHT)][2 + WIDTH]
				, float trace [(2 + HEIGHT)][2 + WIDTH])
				{
				for (int _i0 = 1; (_i0 - HEIGHT - 1 < 0); _i0++) {
				for (int _i1 = 1; (_i1 - WIDTH - 1 < 0); _i1++) {
				(Iy)[_i0][_i1] =
				(((((((inputImg[_i0 - 1][_i1 - 1]) * -0.0833333333333f) +
				((inputImg[_i0 - 1][_i1 + 1]) * 0.0833333333333f)) +
				((inputImg[_i0][_i1 - 1]) * -0.166666666667f)) +
				((inputImg[_i0][_i1 + 1]) * 0.166666666667f)) +
				((inputImg[_i0 + 1][_i1 - 1]) * -0.0833333333333f)) +
				((inputImg[_i0 + 1][_i1 + 1]) * 0.0833333333333f));
				}
				}

				for (int _i0 = 1; (_i0 - HEIGHT - 1 < 0); _i0++) {
				for (int _i1 = 1; (_i1 - WIDTH - 1 < 0); _i1++) {
				(Ix)[_i0][_i1] =
				(((((
				(inputImg[-1 + _i0][-1 + _i1] * -0.0833333333333f) +
				(inputImg[ 1 + _i0][-1 + _i1] * 0.0833333333333f)) +
				(inputImg[-1 + _i0][_i1 ] * -0.166666666667f)) +
				(inputImg[ 1 + _i0][_i1 ] * 0.166666666667f)) +
				(inputImg[-1 + _i0][ 1 + _i1] * -0.0833333333333f)) +
				(inputImg[ 1 + _i0][ 1 + _i1] * 0.0833333333333f));
				}
				}

				for (int _i0 = 1; (_i0 - HEIGHT - 1 < 0); _i0++) {
				for (int _i1 = 1; (_i1 - WIDTH - 1 < 0); _i1++) {
				Iyy[_i0][_i1] = Iy[_i0][_i1] * Iy[_i0][_i1];
				}
				}

				for (int _i0 = 1; (_i0 - HEIGHT - 1 < 0); _i0++) {
				for (int _i1 = 1; (_i1 - WIDTH - 1 < 0); _i1++) {
				Ixy[_i0][_i1] = Ix[_i0][_i1] * Iy[_i0][_i1];
				}
				}

				for (int _i0 = 1; (_i0 - HEIGHT - 1 < 0); _i0++) {
				for (int _i1 = 1; (_i1 - WIDTH - 1 < 0); _i1++) {
				Ixx[_i0][_i1] = Ix[_i0][_i1] * Ix[_i0][_i1];
				}
				}

				for (int _i0 = 2; (_i0 < HEIGHT); _i0++) {
				for (int _i1 = 2; (_i1 < WIDTH); _i1++) {
				Syy[_i0][_i1] = ((((((((Iyy[-1 + _i0][-1 + _i1] +
				Iyy[-1 + _i0][_i1]) +
				Iyy[-1 + _i0][1 + _i1]) +
				Iyy[_i0][-1 + _i1]) +
				Iyy[_i0][_i1]) +
				Iyy[_i0][1 + _i1]) +
				Iyy[1 + _i0][-1 + _i1]) +
				Iyy[1 + _i0][_i1]) +
				Iyy[1 + _i0][1 + _i1]);
				}
				}

				for (int _i0 = 2; (_i0 < HEIGHT); _i0++) {
				for (int _i1 = 2; (_i1 < WIDTH); _i1++) {
				Sxy[_i0][_i1] = ((((((((Ixy[-1 + _i0][-1 + _i1] + Ixy[-1 + _i0][_i1]) +
				Ixy[-1 + _i0][1 + _i1]) +
				Ixy[_i0][-1 + _i1]) +
				Ixy[_i0][_i1]) +
				Ixy[_i0][1 + _i1]) +
				Ixy[1 + _i0][-1 + _i1]) +
				Ixy[1 + _i0][_i1]) +
				Ixy[1 + _i0][1 + _i1]);
				}
				}

				for (int _i0 = 2; (_i0 < HEIGHT); _i0++) {
				for (int _i1 = 2; (_i1 < WIDTH); _i1++) {
				Sxx[_i0][_i1] = ((((((((Ixx[-1 + _i0][-1 + _i1] + Ixx[-1 + _i0][_i1]) +
				Ixx[-1 + _i0][1 + _i1]) +
				Ixx[_i0][-1 + _i1]) +
				Ixx[_i0][_i1]) +
				Ixx[_i0][1 + _i1]) +
				Ixx[1 + _i0][-1 + _i1]) +
				Ixx[1 + _i0][_i1]) +
				Ixx[1 + _i0][1 + _i1]);
				}
				}

				for (int _i0 = 2; (_i0 < HEIGHT); _i0++) {
				for (int _i1 = 2; (_i1 < WIDTH); _i1++) {
				trace[_i0][_i1] = (Sxx[_i0][_i1] + Syy[_i0][_i1]);
				}
				}

				for (int _i0 = 2; (_i0 < HEIGHT); _i0++) {
				for (int _i1 = 2; (_i1 < WIDTH); _i1++) {
				det[_i0][_i1] = ((Sxx[_i0][_i1] * Syy[_i0][_i1]) - (Sxy[_i0][_i1] * Sxy[_i0][_i1]));
				}
				}

				for (int _i0 = 2; (_i0 < HEIGHT); _i0++) {
				for (int _i1 = 2; (_i1 < WIDTH); _i1++) {
				outputImg[_i0][_i1] = (det[_i0][_i1] - ((0.04f * trace[_i0][_i1]) * trace[_i0][_i1]));
				}
				}
				}

MicroBenchmarks/harris/main.cpp

				/* For polymage-benchmarks-harris kernel
				Copyright (c) 2015 Indian Institute of Science
				All rights reserved.

				Written and provided by:
				Ravi Teja Mullapudi, Vinay Vasista, Uday Bondhugula
				Dept of Computer Science and Automation
				Indian Institute of Science
				Bangalore 560012
				India

				Redistribution and use in source and binary forms, with or without
				modification, are permitted provided that the following conditions are met:

				1. Redistributions of source code must retain the above copyright
				notice, this list of conditions and the following disclaimer.

				2. Redistributions in binary form must reproduce the above copyright
				notice, this list of conditions and the following disclaimer in the
				documentation and/or other materials provided with the distribution.

				3. Neither the name of the Indian Institute of Science nor the
				names of its contributors may be used to endorse or promote products
				derived from this software without specific prior written permission.

				THIS MATERIAL IS PROVIDED BY Ravi Teja Mullapudi, Vinay Vasista, and Uday
				Bondhugula, Indian Institute of Science ''AS IS'' AND ANY EXPRESS OR IMPLIED
				WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
				MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
				EVENT SHALL Ravi Teja Mullapudi, Vinay Vasista, CSA Indian Institute of
				Science BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
				CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
				SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
				INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
				CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
				ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
				POSSIBILITY OF SUCH DAMAGE.
				*/

				// ============================================================================
				/*
				* Pankaj Kukreja
				* Indian Institute of Technology Hyderabad
				*
				* Acknowledgements
				// ============================================================================
				* HARRIS KERNEL from Polymage benchmark (modified)
				* File: polymage-benchmarks/apps/harris/harris_polymage_naive.cpp
				*/
				// ============================================================================

				#include "harris.h"
				int sum=0;


				#ifdef BENCHMARK_LIB
				#include "benchmark/benchmark.h"
				#endif


				// This function initializes the input image to checkbox image
				// Can be replaced with any other image initialization
				void initCheckboardImage(int height, int width, float image[(2 + HEIGHT)][2 + WIDTH]) {
				int last_pixel_x = 0;
				int last_pixel_y = 0;
				for (int i = 0; i < height; i++) {
				if (i % BOX_SIZE == 0) {
				last_pixel_y = (last_pixel_y + 1) % 2;
				}
				last_pixel_x = last_pixel_y;
				for (int j = 0; j < width; j++) {
				if (j % BOX_SIZE == 0) {
				last_pixel_x = (last_pixel_x + 1) % 2;
				}
				if (last_pixel_x == 0) {
				image[i][j] = 255;
				}
				else {
				image[i][j] = 0;
				}
				}
				}
				}

				// Writes image matrix to a file.
				void printImage(int height, int width, float arr[(2 + HEIGHT)][2 + WIDTH], int dummy) {
				std::ofstream myfile;
				myfile.open("output.txt");
				homerdinUnsubmitted Not Done Reply Inline Actions This will write `output.txt` into whichever directory lit is run in. You can set the working directory that the test will run in by passing an argument to `llvm_test_run()` > `llvm_test_run(WORKDIR ${CMAKE_CURRENT_BINARY_DIR})` homerdin: This will write `output.txt` into whichever directory lit is run in. You can set the working…
				for (int i = 0; i < height - 2; i++) {
				for (int j = 0; j < width - 2; j++) {
				if(arr[i][j]<0) {
				myfile << 0;
				}
				else if(arr[i][j]>255) {
				myfile << 3;
				}
				else {
				myfile << (int)(arr[i][j]);
				}
				}
				myfile << "\n";
				}
				//Dummy code to make sure the allocated ImageOutput Array is not optimized out
				if(dummy > 0){
				myfile << sum;
				}
				}



				#ifdef BENCHMARK_LIB
				void BENCHMARK_HARRIS(benchmark::State &state) {
				float (*__restrict__ image)[HEIGHT + 2][WIDTH + 2];
				image = (float()[2+HEIGHT][2+WIDTH]) malloc(sizeof(float) (2+HEIGHT) * (2+WIDTH));
				initCheckboardImage((HEIGHT + 2), (WIDTH + 2), *image);

				float (* __restrict__ imageOutput)[2 + HEIGHT][2 + WIDTH];
				imageOutput = (float()[2+HEIGHT][2+WIDTH]) malloc(sizeof(float) (2+HEIGHT) * (2+WIDTH));

				float (* __restrict__ Ix) [2 + HEIGHT][2 + WIDTH];
				dberrisUnsubmitted Not Done Reply Inline Actions Why are these still using `HEIGHT` and `WIDTH`? Why aren't these just: const size_t height = state.range(0); const size_t weight = state.range(1); float image = reinterpret_cast<float>(malloc(sizeof(float) * (2 + height) * (2 + width))); float imageOutput = reinterpret_cast<float>(malloc(sizeof(float) * (2 + height) * (2 + width))); dberris: Why are these still using `HEIGHT` and `WIDTH`? Why aren't these just: ``` const size_t height…
				float (* __restrict__ Iy) [2 + HEIGHT][2 + WIDTH];
				dberrisUnsubmitted Done Reply Inline Actions There's a few comments I have about this code, but let me start with the simple(r) ones: You may want to make the image sizes configurable, and using the benchmark API to try it on differently sized images (so that you can see how the algorithm scales based on input sizes). You probably want to ensure that you do something with the image data after the loop(s) to ensure that the allocations aren't optimised away. You can use the benchmark::DoNotOptimize(...) function to do some of that. You might also want to consider measuring throughput (computing the amount of data processed for the time it took per iteration). Before you get into the loop, you should probably run the kernel once on the just-allocated memory, to ensure that you're not just measuring the cost of pulling data through the cache(s). This is probably better to do with smaller images. dberris: There's a few comments I have about this code, but let me start with the simple(r) ones: - You…
				float (* __restrict__ Ixx) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Ixy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Iyy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Sxx) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Sxy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Syy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ det) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ trace)[2 + HEIGHT][2 + WIDTH];

				Ix = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Iy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Ixx = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Ixy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Iyy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Sxx = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Sxy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Syy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				det = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				trace = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));

				harrisKernel(HEIGHT, WIDTH
				, image, imageOutput
				, Ix, Iy
				, Ixx, Ixy, *Iyy
				, Sxx, Sxy, *Syy
				, det, trace);

				for (auto _ : state) {
				harrisKernel(HEIGHT, WIDTH
				, image, imageOutput
				, Ix, Iy
				, Ixx, Ixy, *Iyy
				, Sxx, Sxy, *Syy
				, det, trace);
				}

				// state.SetBytesProcessed(static_cast<int64_t>(state.iterations())WIDTHHEIGHT*50); // ??

				free((void*)Ix);
				free((void*)Iy);
				free((void*)Ixx);
				free((void*)Ixy);
				free((void*)Iyy);
				free((void*)Sxx);
				free((void*)Sxy);
				free((void*)Syy);
				free((void*)det);
				free((void*)trace);

				for(int i =0;i<HEIGHT+2;i++) {
				for(int j=0;j<WIDTH+2;j++) {
				sum = (sum+1) & (int)(*imageOutput)[i][j];
				}
				}
				free((void*)imageOutput);
				free((void*)image);
				}
				BENCHMARK(BENCHMARK_HARRIS)->Unit(benchmark::kMicrosecond);
				dberrisUnsubmitted Not Done Reply Inline Actions It seems that HEIGHT and WIDTH are input values anyway, consider making multiple input sizes to see how the kernel performs as you scale the image size goes up. You might also not need the `__restrict__` attributes for the malloc-provided heap memory either. This means you could do: float image = reinterpret_cast<float>(malloc(sizeof(float) * (2 + state.range(0)) * (2 + state.range(1)))); When you register the benchmark, you can then provide the image sizes to test with: BENCHMARK(HarrisBenchmark) ->Unit(benchmark::kMicrosecond) ->Args({256, 256}) ->Args({512, 512}) ->Args({1024, 1024}) ->Args({2048, 2048}); You can see more options at https://github.com/google/benchmark#passing-arguments. Another thing you may consider measuring as I suggested in the past is throughput. To do that, you can call `state.SetBytesProcessed(...)` in the benchmark body, typically at the end just before exiting -- you want to essentially report something like: state.SetBytesProcessed(sizeof(float) * (state.range(0) + 2) * (state.range(1) + 2) * state.iterations()); This will add a "MB/sec" output alongside the time it took for each iteration of the benchmark. dberris: It seems that HEIGHT and WIDTH are input values anyway, consider making multiple input sizes to…
				protonAuthorUnsubmitted Not Done Reply Inline Actions Cannot use float *image as pointers may overlap and this prevents Polly from detecting scops. I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = reinterpret_cast<float ()[2+height][2+width]>((float ) malloc(...)); " is not allowed by clang++ I did considered adding SetBytesProcessed but I was not sure how many bytes should be written as argument (output image size or the total bytes accessed in kernel) so I commented the line "SetBytesProcessed(static_cast<int64_t>(state.iterations())WIDTHHEIGHT50);" but forgot to ask about it. proton:* Cannot use float **image as pointers may overlap and this prevents Polly from detecting scops.
				dberrisUnsubmitted Not Done Reply Inline Actions I don't know whether you want to optimise for Polly or make Polly just recognise these pointers shouldn't overlap. If Polly can't detect that these pointers are coming from different 'malloc' calls, then I suspect that's a bug in Polly rather than something you need to work around in the benchmark. Note that maybe the better thing to do is to change the kernel's API to put `restrict` or `__restrict__` on the pointers, so that the optimiser in those cases might be able to assume that the pointers don't alias and don't do anything special in this benchmark. See my top-level comment for alternatives to explore, if you're open to it. dberris: I don't know whether you want to optimise for Polly or make Polly just recognise these pointers…
				#endif


				int main(int argc, char *argv[]) {
				sum =1;
				#ifdef BENCHMARK_LIB
				::benchmark::Initialize(&argc, argv);
				dberrisUnsubmitted Done Reply Inline Actions Can you re-format this? Preferably with clang-format if possible, so that it's easier to read. dberris: Can you re-format this? Preferably with clang-format if possible, so that it's easier to read.
				if (::benchmark::ReportUnrecognizedArguments(argc, argv))
				return 1;
				::benchmark::RunSpecifiedBenchmarks();
				#endif

				// Extra Call to verify output of kernel
				float (*__restrict__ image)[HEIGHT + 2][WIDTH + 2];
				image = (float()[2+HEIGHT][2+WIDTH]) malloc(sizeof(float) (2+HEIGHT) * (2+WIDTH));
				initCheckboardImage((HEIGHT + 2), (WIDTH + 2), *image);

				float (* __restrict__ imageOutput)[2 + HEIGHT][2 + WIDTH];
				imageOutput = (float()[2+HEIGHT][2+WIDTH]) malloc(sizeof(float) (2+HEIGHT) * (2+WIDTH));


				float (* __restrict__ Ix) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Iy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Ixx) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Ixy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Iyy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Sxx) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Sxy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ Syy) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ det) [2 + HEIGHT][2 + WIDTH];
				float (* __restrict__ trace)[2 + HEIGHT][2 + WIDTH];

				Ix = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Iy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Ixx = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Ixy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Iyy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Sxx = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Sxy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				Syy = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				det = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));
				trace = (float(* __restrict__) [2+HEIGHT][2+WIDTH]) malloc(sizeof(float)* (2+HEIGHT) * (2+WIDTH));

				harrisKernel(HEIGHT, WIDTH
				, image, imageOutput
				, Ix, Iy
				, Ixx, Ixy, *Iyy
				, Sxx, Sxy, *Syy
				, det, trace);

				free((void*)Ix);
				free((void*)Iy);
				free((void*)Ixx);
				free((void*)Ixy);
				free((void*)Iyy);
				free((void*)Sxx);
				free((void*)Sxy);
				free((void*)Syy);
				free((void*)det);
				free((void*)trace);

				if(argc==2) {
				printImage(HEIGHT + 2, WIDTH + 2, *imageOutput, sum);
				}
				else {
				printImage(HEIGHT + 2, WIDTH + 2, *imageOutput, -1);
				}

				free((void*)image);
				free((void*)imageOutput);
				return 0;
				}

This is an archive of the discontinued LLVM Phabricator instance.

[test-suite][RFC] Using Google Benchmark Library on Harris KernelClosedPublic

Details

Diff Detail

Event Timeline

RunTime

lit --vg --vg-leak

Runtime on this differential

Results

Revision Contents

Diff 151811

MicroBenchmarks/CMakeLists.txt

MicroBenchmarks/harris/CMakeLists.txt

MicroBenchmarks/harris/harris.h

MicroBenchmarks/harris/harris.reference_output

MicroBenchmarks/harris/harrisKernel.cpp

MicroBenchmarks/harris/main.cpp

[test-suite][RFC] Using Google Benchmark Library on Harris Kernel
ClosedPublic