This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
MicroBenchmarks/
-
CMakeLists.txt
-
harris/
-
CMakeLists.txt
2
harris.cpp
1
sha1.hpp
-
sha1.cpp

Differential D47675

[test-suite][RFC] Using Google Benchmark Library on Harris Kernel
ClosedPublic

Authored by proton on Jun 2 2018, 5:06 AM.

Download Raw Diff

Details

Reviewers

dberris
Meinersbur
homerdin
MatzeB
hfinkel
cmatthews
kristof.beyls

Commits

rOLDT335611: [test-suite] Using Google Benchmark Library on Harris Kernel
rL335611: [test-suite] Using Google Benchmark Library on Harris Kernel

Summary

Hi,
I have used google benchmark library on Harris corner detection kernel (from polymage benchmarks). I want to know if the way I have used the google benchmark library here is correct or there can be a better way to do this.

Diff Detail

Repository: rOLDT svn-test-suite

Event Timeline

proton created this revision.Jun 2 2018, 5:06 AM

Herald added subscribers: llvm-commits, mgorny. · View Herald TranscriptJun 2 2018, 5:06 AM

Did you consider putting the driver code (initialization, malloc, free, checking result), Into a different file than the kernel code? This would allow measuring it (code size, LLVM statistics, compile time, etc.) independently from the boilerplate code.

MicroBenchmarks/harris/harris.cpp
177–197	Maybe the malloc/free calls should be taken out of the measured kernel.
201–208	Could you rewrite this to use multidimensional access subscripts? E.g. `img[_i0-1][_i1-1]`.
MicroBenchmarks/harris/sha1.hpp
1–45	There is already hashing used by test-suite (see `HashProgramOutput.sh`), why add another one?

proton added reviewers: MatzeB, hfinkel, cmatthews, kristof.beyls.Jun 5 2018, 12:14 PM

proton updated this revision to Diff 150032.Jun 5 2018, 1:16 PM

proton updated this revision to Diff 150815.Jun 11 2018, 12:40 PM

proton edited the summary of this revision. (Show Details)

Looks great. Did you do a performance comparison with/without Polly?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

[comment] Is there are a reason why the init.cpp and main.cpp are are separate files?

MicroBenchmarks/harris/harris.h
33 ↗	(On Diff #150815)	[style] `DUMP_IMAGE` is named like a macro, but is a function.
MicroBenchmarks/harris/harris_kernel.cpp
118 ↗	(On Diff #150815)	[style] `return` before the end of the function seem unnecessary.

proton updated this revision to Diff 151105.Jun 13 2018, 12:11 AM

In D47675#1129087, @Meinersbur wrote:

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

In D47675#1130720, @proton wrote:

In D47675#1129087, @Meinersbur wrote:

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

How long is a single run with O3?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

It still writes to the target array in parallel without locking.

I suggest to call harrisKernel once more only for the correctness check,.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

If it is supposed to be a shared resource, it shouldn't be in the harris directory.

Could you check whether llvm-lit correctly collects execution time,compile/link time, LLVM -stats, code size?

Updated input size, used malloc to allocate memory for the array.

In D47675#1131201, @Meinersbur wrote:

Could you check whether llvm-lit correctly collects execution time, compile/link time, LLVM -stats, code size?

I don't know how to check LLVM -stats using lit.
Sizes matches the output of llvm-size, compile time and link time are also fine.

lit Output: used "lit ." in build/Microbenchmark/harris
compile_time: 1.3595
link_time: 0.0832
exec size: 364288
exec_time: 28254000.0000
Testing Time: 1.22s
.
.
.

For now, I have merged the init.cpp and main.cpp. The image initialization here is special to visualize the output nicely. We may/may not use this initialization for other image processing kernels.
If at all there is a common image initialization source code introduced in future (maybe with multiple type of image init and some helper function that helps in image processing), there will be only minute changes to main.cpp file.

Here are some of the stats for this code:

RunTime

With Benchmark library

Flag	CPU Time	Iteration	System measured (using time ./a.out)
`O0`	117259447ns	6	1.039s
`O3`	32498956ns	21	1.214s
`O3+Polly`	32562202ns	21	1.205s

Without Benchmark library (input size is changed - refer harris.h)

Flag	System measured time
`O0`	1.241s
`O3`	0.697s
`O3+Polly`	0.611s

Note: Clang is built in debug mode and I have compiled benchmark using clang++ -O3 -mllvm -polly --std=c++11 harrisKernel.cpp main.cpp -lbenchmark -lpthread -o withPolly.out

lit --vg --vg-leak

It Fails... but so does other benchmarks in microbenchmark folder so maybe there is memory leak problem with benchmark library

I am thinking of manually verifying output as there is a pattern to output with this checkbox initialization, let me know if it is a good idea or not.
I have removed the reference output for now as it is of size 10MB, I will update the diff with reference output once the checking method is finalized.

dberris added inline comments.Jun 13 2018, 7:07 PM

MicroBenchmarks/harris/main.cpp
107–120 ↗	(On Diff #151280)	There's a few comments I have about this code, but let me start with the simple(r) ones: You may want to make the image sizes configurable, and using the benchmark API to try it on differently sized images (so that you can see how the algorithm scales based on input sizes). You probably want to ensure that you do something with the image data after the loop(s) to ensure that the allocations aren't optimised away. You can use the benchmark::DoNotOptimize(...) function to do some of that. You might also want to consider measuring throughput (computing the amount of data processed for the time it took per iteration). Before you get into the loop, you should probably run the kernel once on the just-allocated memory, to ensure that you're not just measuring the cost of pulling data through the cache(s). This is probably better to do with smaller images.

Using Polly's -debug-only=polly-scops output showed that the kernel is in fact not optimized:

Invalidate SCoP because of reason 0

NOTE: Run time checks for %for.cond11.preheader---%for.cond.cleanup532 could not be created as the number of parameters involved is too high. The SCoP will be dismissed.
Use:
        --polly-rtc-max-parameters=X
to adjust the maximal number of parameters but be advised that the compile time might increase exponentially.

Bailing-out because could not build alias checks

Using -polly-rtc-max-parameters=999 does not help. I remember that Polly prints that message whenever it cannot build the alias checks, even for other reasons than stated.

Runtime on this differential

Benchmark	Time	CPU	Iterations
Polly	28825	28757	25
Polly	28964	28910	24
-O3	13216	13191	54
-O3	13072	13049	54
-O0	98790	98716	7
-O0	99720	99652	7

Modified such that polly is now able to make some changes in the kernel (no runtime checks problem).

Did you check whether Polly recognizes the call to exp as part of the SCoP?

dberris added inline comments.Jun 19 2018, 10:29 PM

MicroBenchmarks/harris/main.cpp
179 ↗	(On Diff #151811)	It seems that HEIGHT and WIDTH are input values anyway, consider making multiple input sizes to see how the kernel performs as you scale the image size goes up. You might also not need the `__restrict__` attributes for the malloc-provided heap memory either. This means you could do: float image = reinterpret_cast<float>(malloc(sizeof(float) * (2 + state.range(0)) * (2 + state.range(1)))); When you register the benchmark, you can then provide the image sizes to test with: BENCHMARK(HarrisBenchmark) ->Unit(benchmark::kMicrosecond) ->Args({256, 256}) ->Args({512, 512}) ->Args({1024, 1024}) ->Args({2048, 2048}); You can see more options at https://github.com/google/benchmark#passing-arguments. Another thing you may consider measuring as I suggested in the past is throughput. To do that, you can call `state.SetBytesProcessed(...)` in the benchmark body, typically at the end just before exiting -- you want to essentially report something like: state.SetBytesProcessed(sizeof(float) * (state.range(0) + 2) * (state.range(1) + 2) * state.iterations()); This will add a "MB/sec" output alongside the time it took for each iteration of the benchmark.

proton updated this revision to Diff 152211.Jun 20 2018, 6:35 PM

proton marked an inline comment as done.

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

MicroBenchmarks/harris/main.cpp

111–119 ↗

(On Diff #152211)

Why are these still using HEIGHT and WIDTH? Why aren't these just:

const size_t height = state.range(0);
const size_t weight = state.range(1);

float **image = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width)));
float **imageOutput = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width)));

185 ↗

(On Diff #152211)

Can you re-format this? Preferably with clang-format if possible, so that it's easier to read.

Formatted using clang format

Results

With Polly:

BENCHMARK	Time	CPU	Iteration	Throughput
BENCHMARK_HARRIS/256/256	769 us	768 us	911	330.508MB/s
BENCHMARK_HARRIS/512/512	4001 us	3996 us	177	252.207MB/s
BENCHMARK_HARRIS/1024/1024	25690 us	25650 us	28	156.553MB/s
BENCHMARK_HARRIS/2048/2048	118023 us	117830 us	6	136.054MB/s

Without Polly:

BENCHMARK	Time	CPU	Iteration	Throughput
BENCHMARK_HARRIS/256/256	626 us	625 us	1135	406.15MB/s
BENCHMARK_HARRIS/512/512	3074 us	3068 us	229	328.51MB/s
BENCHMARK_HARRIS/1024/1024	17121 us	17086 us	42	235.022MB/s
BENCHMARK_HARRIS/2048/2048	64207 us	64077 us	11	250.189MB/s

In D47675#1138707, @dberris wrote:

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

I replied to your previous comment but for some reason, it is showing only after your inlined comment not here.

I cannot use a pointer to pointer of an array (float **) here as the compiler may think that some pointers may overlap and prevents Polly from detecting SCoPs here.

I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++

Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?

MicroBenchmarks/harris/main.cpp
179 ↗	(On Diff #151811)	Cannot use float *image as pointers may overlap and this prevents Polly from detecting scops. I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = reinterpret_cast<float ()[2+height][2+width]>((float ) malloc(...)); " is not allowed by clang++ I did considered adding SetBytesProcessed but I was not sure how many bytes should be written as argument (output image size or the total bytes accessed in kernel) so I commented the line "SetBytesProcessed(static_cast<int64_t>(state.iterations())WIDTHHEIGHT*50);" but forgot to ask about it.

In D47675#1139805, @proton wrote:

In D47675#1138707, @dberris wrote:

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

I replied to your previous comment but for some reason, it is showing only after your inlined comment not here.

I cannot use a pointer to pointer of an array (float **) here as the compiler may think that some pointers may overlap and prevents Polly from detecting SCoPs here.

I'm not sure that's entirely true -- the pointer is coming from malloc, so they're meant to not overlap. At least LLVM should be able to detect that.

I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++

It's not allowed because height and width are not constant expressions. Note that even in function arguments, the "array" form will be treated as pointers anyway so it shouldn't make any difference if you change the API to:

void harrisKernel(
    int height, int width, float **inputImg,
    float **outputImg, float **Ix,
    float **Iy, float **Ixx,
    float **Ixy, float **Iyy,
    float **Sxx, float **Sxy,
    float **Syy, float **det,
    float **trace)

If anything, you want to mark those pointers that the compiler is supposed to treat as non-aliasing as restrict or __restrict__.

Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?

The bytes processed is what you say it is -- as I suggested, it is based on the input size (size of the input image). If you want to measure something else, you're going to have to provide what that throughput is.

The beauty of having a benchmark suite is that you can test out these various approaches in the same benchmark. You can explore alternative strategies like:

Turn the harris kernel implementation into a template, so you can use C++ features and take const std::array<std::array<float, W>, H>&, and let the compiler deduce the sizes.
Have a version of the kernel with pointer arguments with __restrict__ and one without.
Instead of taking output pointers, consider returning a struct/tuple with std::unique_ptr<float[][]> members.
Use C++ array new and array delete (say new float*[(height *2) * (width * 2)]).

Anyway, I'm fine with the state of it currently, but would like to see more work done if not now but in the future. This will be really helpful for people working on the compiler trying to see whether performance/throughput can be improved (or regress) with changes to the compiler. I'm sure the folks working on Polly would like to be able to diagnose why the Polly version is slower than the normal build.

Please wait for someone else to LGTM/Accept before landing. I'm sure we can spend a lot more time trying to make these benchmarks better, but in the meantime having *something* in there is better than not having one in there.

MicroBenchmarks/harris/main.cpp
179 ↗	(On Diff #151811)	I don't know whether you want to optimise for Polly or make Polly just recognise these pointers shouldn't overlap. If Polly can't detect that these pointers are coming from different 'malloc' calls, then I suspect that's a bug in Polly rather than something you need to work around in the benchmark. Note that maybe the better thing to do is to change the kernel's API to put `restrict` or `__restrict__` on the pointers, so that the optimiser in those cases might be able to assume that the pointers don't alias and don't do anything special in this benchmark. See my top-level comment for alternatives to explore, if you're open to it.

This revision is now accepted and ready to land.Jun 21 2018, 4:37 PM

In D47675#1140100, @dberris wrote:

Please wait for someone else to LGTM/Accept before landing. I'm sure we can spend a lot more time trying to make these benchmarks better, but in the meantime having *something* in there is better than not having one in there.

I agree with @dberris. If landed we can look into addition additional such benchmarks and base patches to common structures.

homerdin added inline comments.Jun 25 2018, 11:43 AM

MicroBenchmarks/harris/main.cpp
87 ↗	(On Diff #152363)	This will write `output.txt` into whichever directory lit is run in. You can set the working directory that the test will run in by passing an argument to `llvm_test_run()` > `llvm_test_run(WORKDIR ${CMAKE_CURRENT_BINARY_DIR})`

added LICENSE and fixed work directory issue.

Closed by commit rL335611: [test-suite] Using Google Benchmark Library on Harris Kernel (authored by homerdin). · Explain WhyJun 26 2018, 8:12 AM

This revision was automatically updated to reflect the committed changes.

Meinersbur mentioned this in D101844: [MicroBenchmarks] Add initial loop vectorization benchmarks..May 11 2021, 9:02 AM

Revision Contents

Path

Size

MicroBenchmarks/

CMakeLists.txt

1 line

harris/

5 lines

358 lines

45 lines

307 lines

Diff 149608

MicroBenchmarks/CMakeLists.txt

Context not available.
	add_subdirectory(libs)	add_subdirectory(libs)
	add_subdirectory(XRay)	add_subdirectory(XRay)
	add_subdirectory(LCALS)	add_subdirectory(LCALS)
		add_subdirectory(harris)
Context not available.

MicroBenchmarks/harris/CMakeLists.txt

				list(APPEND CPPFLAGS -std=c++11 )
				llvm_test_run()

				llvm_test_executable(harris harris.cpp sha1.cpp)
				target_link_libraries(harris benchmark)

MicroBenchmarks/harris/harris.cpp

				/* For polymage-benchmarks-harris kernel
				Copyright (c) 2015 Indian Institute of Science
				All rights reserved.

				Written and provided by:
				Ravi Teja Mullapudi, Vinay Vasista, Uday Bondhugula
				Dept of Computer Science and Automation
				Indian Institute of Science
				Bangalore 560012
				India

				Redistribution and use in source and binary forms, with or without
				modification, are permitted provided that the following conditions are met:

				1. Redistributions of source code must retain the above copyright
				notice, this list of conditions and the following disclaimer.

				2. Redistributions in binary form must reproduce the above copyright
				notice, this list of conditions and the following disclaimer in the
				documentation and/or other materials provided with the distribution.

				3. Neither the name of the Indian Institute of Science nor the
				names of its contributors may be used to endorse or promote products
				derived from this software without specific prior written permission.

				THIS MATERIAL IS PROVIDED BY Ravi Teja Mullapudi, Vinay Vasista, and Uday
				Bondhugula, Indian Institute of Science ''AS IS'' AND ANY EXPRESS OR IMPLIED
				WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
				MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
				EVENT SHALL Ravi Teja Mullapudi, Vinay Vasista, CSA Indian Institute of
				Science BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
				CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
				SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
				INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
				CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
				ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
				POSSIBILITY OF SUCH DAMAGE.
				*/

				// ============================================================================
				/*
				* Pankaj Kukreja
				* Indian Institute of Technology Hyderabad
				*
				* Acknowledgements
				// ============================================================================
				* SHA1SUM computation source code from:
				* https://github.com/vog/sha1
				* 100% Public Domain
				// ============================================================================
				* HARRIS KERNEL (modified)
				* https://github.com/bondhugula/polymage-benchmarks/
				* File: polymage-benchmarks/apps/harris/harris_polymage_naive.cpp
				*/
				// ============================================================================

				#include "sha1.hpp"
				#include <cstring>
				#include <fstream>
				#include <iomanip>
				#include <iostream>
				#include <string>

				// ============================================================================
				// ============================================================================

				// Image Size
				#define HEIGHT 100
				#define WIDTH 100
				// (Any box size will work)
				// This parameter is used in input
				#define BOX_SIZE 10

				// ============================================================================
				// ============================================================================

				// Some Parameters
				#define THRESHOLD 0.1 /For harris kernel/

				// ============================================================================
				// ============================================================================

				#define BENCHMARK_LIB /* Comment this to not use google benchmark library*/
				#ifdef BENCHMARK_LIB
				#include "benchmark/benchmark.h"
				#endif

				// ============================================================================

				void init_checkboard_image(int height, int width);
				void write_to_file(int height, int width, int *array);
				int check_output();
				void harris(float *imag);

				// ============================================================================
				using namespace std;
				float *__restrict__ image;

				// This function initializes the input image to checkbox image
				// Can be replaced with any other image initialization
				void init_checkboard_image(int height, int width) {
				image = (float )malloc((HEIGHT + 2) (WIDTH + 2) * sizeof(float));
				// 2 different x and y, as we want alternate in both direction
				int last_pixel_x = 0;
				int last_pixel_y = 0;

				// Initialize a random image
				for (int i = 0; i < height; i++) {
				if (i % BOX_SIZE == 0) {
				last_pixel_y = (last_pixel_y + 1) % 2;
				}
				last_pixel_x = last_pixel_y;
				for (int j = 0; j < width; j++) {
				if (j % BOX_SIZE == 0) {
				last_pixel_x = (last_pixel_x + 1) % 2;
				}
				if (last_pixel_x == 0)
				image[i * width + j] = 255;
				else
				image[i * width + j] = 0;
				}
				}
				}

				// Writes image matrix to a file (asuming values are only 0, 255 and 3.
				// If different values are also there then remove % 8
				void write_to_file(int height, int width, float *array) {
				ofstream myfile;
				myfile.open("output.txt");

				for (int i = 0; i < height - 2; i++) {
				for (int j = 0; j < width - 2; j++) {
				myfile << int(array[i * (width) + j]) % 8;
				}
				myfile << "\n";
				}
				}

				/*

				This function is called in bechmark macro

				-> Since the kernel modifes the original image array, I had to copy values to a
				temporary array and then pass that array and then write the output to file. If
				there is any better way to do this let me know.

				-> If I do not copy the values and directly pass image array, then some thread
				may read the modified value of image array instead of original value.
				*/

				#ifdef BENCHMARK_LIB
				void BENCHMARK_HARRIS(benchmark::State &state) {
				int R = HEIGHT;
				int C = WIDTH;
				float *img;
				img = (float *)(malloc(
				(sizeof(float) * ((2 + R) * (2 + C))))); // for first free
				for (auto _ : state) {
				state.PauseTiming();
				free(img);
				img = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				std::memcpy(img, image, (2 + R) * (2 + C) * sizeof(float));
				state.ResumeTiming();
				harris(img);
				}
				write_to_file(HEIGHT + 2, WIDTH + 2, img);
				free(img);
				}
				BENCHMARK(BENCHMARK_HARRIS);
				#endif

				// harris kernel from polymage_naive.cpp
				void harris(float *__restrict__ imag) {
				int C = WIDTH;
				int R = HEIGHT;
				float *img;
				img = (float *)imag;
				float *Ix;
				Ix = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *Iy;
				Iy = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *Ixx;
				Ixx = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *Ixy;
				Ixy = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *Iyy;
				Iyy = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *Sxx;
				Sxx = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *Sxy;
				Sxy = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *Syy;
				Syy = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *det;
				det = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				float *trace;
				trace = (float )(malloc((sizeof(float) ((2 + R) * (2 + C)))));
				MeinersburUnsubmitted Not Done Reply Inline Actions Maybe the malloc/free calls should be taken out of the measured kernel. Meinersbur: Maybe the malloc/free calls should be taken out of the measured kernel.

				for (int _i0 = 1; (_i0 <= R); _i0 = (_i0 + 1)) {
				for (int _i1 = 1; (_i1 <= C); _i1 = (_i1 + 1)) {
				Iy[((_i0 * (2 + C)) + _i1)] =
				((((((img[(((-1 + _i0) * (C + 2)) + (-1 + _i1))] *
				-0.0833333333333f) +
				(img[(((-1 + _i0) * (C + 2)) + (1 + _i1))] * 0.0833333333333f)) +
				(img[((_i0 * (C + 2)) + (-1 + _i1))] * -0.166666666667f)) +
				(img[((_i0 * (C + 2)) + (1 + _i1))] * 0.166666666667f)) +
				(img[(((1 + _i0) * (C + 2)) + (-1 + _i1))] * -0.0833333333333f)) +
				(img[(((1 + _i0) * (C + 2)) + (1 + _i1))] * 0.0833333333333f));
				MeinersburUnsubmitted Not Done Reply Inline Actions Could you rewrite this to use multidimensional access subscripts? E.g. `img[_i0-1][_i1-1]`. Meinersbur: Could you rewrite this to use multidimensional access subscripts? E.g. `img[_i0-1][_i1-1]`.
				}
				}
				for (int _i0 = 1; (_i0 <= R); _i0 = (_i0 + 1)) {
				for (int _i1 = 1; (_i1 <= C); _i1 = (_i1 + 1)) {
				Ix[((_i0 * (2 + C)) + _i1)] =
				((((((img[(((-1 + _i0) * (C + 2)) + (-1 + _i1))] *
				-0.0833333333333f) +
				(img[(((1 + _i0) * (C + 2)) + (-1 + _i1))] * 0.0833333333333f)) +
				(img[(((-1 + _i0) * (C + 2)) + _i1)] * -0.166666666667f)) +
				(img[(((1 + _i0) * (C + 2)) + _i1)] * 0.166666666667f)) +
				(img[(((-1 + _i0) * (C + 2)) + (1 + _i1))] * -0.0833333333333f)) +
				(img[(((1 + _i0) * (C + 2)) + (1 + _i1))] * 0.0833333333333f));
				}
				}

				for (int _i0 = 1; (_i0 <= R); _i0 = (_i0 + 1)) {
				for (int _i1 = 1; (_i1 <= C); _i1 = (_i1 + 1)) {
				Iyy[((_i0 * (2 + C)) + _i1)] =
				(Iy[((_i0 * (2 + C)) + _i1)] * Iy[((_i0 * (2 + C)) + _i1)]);
				}
				}
				for (int _i0 = 1; (_i0 <= R); _i0 = (_i0 + 1)) {
				for (int _i1 = 1; (_i1 <= C); _i1 = (_i1 + 1)) {
				Ixy[((_i0 * (2 + C)) + _i1)] =
				(Ix[((_i0 * (2 + C)) + _i1)] * Iy[((_i0 * (2 + C)) + _i1)]);
				}
				}

				for (int _i0 = 1; (_i0 <= R); _i0 = (_i0 + 1)) {
				for (int _i1 = 1; (_i1 <= C); _i1 = (_i1 + 1)) {
				Ixx[((_i0 * (2 + C)) + _i1)] =
				(Ix[((_i0 * (2 + C)) + _i1)] * Ix[((_i0 * (2 + C)) + _i1)]);
				}
				}
				for (int _i0 = 2; (_i0 < R); _i0 = (_i0 + 1)) {
				for (int _i1 = 2; (_i1 < C); _i1 = (_i1 + 1)) {
				Syy[((_i0 * (2 + C)) + _i1)] =
				((((((((Iyy[(((-1 + _i0) * (2 + C)) + (-1 + _i1))] +
				Iyy[(((-1 + _i0) * (2 + C)) + _i1)]) +
				Iyy[(((-1 + _i0) * (2 + C)) + (1 + _i1))]) +
				Iyy[((_i0 * (2 + C)) + (-1 + _i1))]) +
				Iyy[((_i0 * (2 + C)) + _i1)]) +
				Iyy[((_i0 * (2 + C)) + (1 + _i1))]) +
				Iyy[(((1 + _i0) * (2 + C)) + (-1 + _i1))]) +
				Iyy[(((1 + _i0) * (2 + C)) + _i1)]) +
				Iyy[(((1 + _i0) * (2 + C)) + (1 + _i1))]);
				}
				}
				for (int _i0 = 2; (_i0 < R); _i0 = (_i0 + 1)) {
				for (int _i1 = 2; (_i1 < C); _i1 = (_i1 + 1)) {
				Sxy[((_i0 * (2 + C)) + _i1)] =
				((((((((Ixy[(((-1 + _i0) * (2 + C)) + (-1 + _i1))] +
				Ixy[(((-1 + _i0) * (2 + C)) + _i1)]) +
				Ixy[(((-1 + _i0) * (2 + C)) + (1 + _i1))]) +
				Ixy[((_i0 * (2 + C)) + (-1 + _i1))]) +
				Ixy[((_i0 * (2 + C)) + _i1)]) +
				Ixy[((_i0 * (2 + C)) + (1 + _i1))]) +
				Ixy[(((1 + _i0) * (2 + C)) + (-1 + _i1))]) +
				Ixy[(((1 + _i0) * (2 + C)) + _i1)]) +
				Ixy[(((1 + _i0) * (2 + C)) + (1 + _i1))]);
				}
				}
				for (int _i0 = 2; (_i0 < R); _i0 = (_i0 + 1)) {
				for (int _i1 = 2; (_i1 < C); _i1 = (_i1 + 1)) {
				Sxx[((_i0 * (2 + C)) + _i1)] =
				((((((((Ixx[(((-1 + _i0) * (2 + C)) + (-1 + _i1))] +
				Ixx[(((-1 + _i0) * (2 + C)) + _i1)]) +
				Ixx[(((-1 + _i0) * (2 + C)) + (1 + _i1))]) +
				Ixx[((_i0 * (2 + C)) + (-1 + _i1))]) +
				Ixx[((_i0 * (2 + C)) + _i1)]) +
				Ixx[((_i0 * (2 + C)) + (1 + _i1))]) +
				Ixx[(((1 + _i0) * (2 + C)) + (-1 + _i1))]) +
				Ixx[(((1 + _i0) * (2 + C)) + _i1)]) +
				Ixx[(((1 + _i0) * (2 + C)) + (1 + _i1))]);
				}
				}
				for (int _i0 = 2; (_i0 < R); _i0 = (_i0 + 1)) {
				for (int _i1 = 2; (_i1 < C); _i1 = (_i1 + 1)) {
				trace[((_i0 * (2 + C)) + _i1)] =
				(Sxx[((_i0 * (2 + C)) + _i1)] + Syy[((_i0 * (2 + C)) + _i1)]);
				}
				}
				for (int _i0 = 2; (_i0 < R); _i0 = (_i0 + 1)) {
				for (int _i1 = 2; (_i1 < C); _i1 = (_i1 + 1)) {
				det[((_i0 * (2 + C)) + _i1)] =
				((Sxx[((_i0 * (2 + C)) + _i1)] * Syy[((_i0 * (2 + C)) + _i1)]) -
				(Sxy[((_i0 * (2 + C)) + _i1)] * Sxy[((_i0 * (2 + C)) + _i1)]));
				}
				}
				for (int _i0 = 2; (_i0 < R); _i0 = (_i0 + 1)) {
				for (int _i1 = 2; (_i1 < C); _i1 = (_i1 + 1)) {
				int value = (det[((_i0 * (2 + C)) + _i1)] -
				((0.04f * trace[((_i0 * (2 + C)) + _i1)]) *
				trace[((_i0 * (2 + C)) + _i1)]));
				if (value > THRESHOLD)
				img[(_i0 * (2 + C)) + _i1] = 3;
				}
				}

				free(Ix);
				free(Iy);
				free(Ixx);
				free(Ixy);
				free(Iyy);
				free(Sxx);
				free(Sxy);
				free(Syy);
				free(det);
				free(trace);
				return;
				}

				int check_output() {
				string filename = "output.txt";
				SHA1 checksum;
				string checksum_val = checksum.from_file(filename);
				std::cout << "Computed SHA1SUM of output.txt is \"" << checksum_val << "\""
				<< std::endl;

				if (checksum_val == "9ad2cea9248750ad10d559bae830640ee1ca16d9") {
				return 1;
				}
				return 0;
				}

				int main(int argc, char *argv[]) {

				init_checkboard_image((HEIGHT + 2), (WIDTH + 2));

				#ifdef BENCHMARK_LIB
				::benchmark::Initialize(&argc, argv);
				if (::benchmark::ReportUnrecognizedArguments(argc, argv))
				return 1;
				::benchmark::RunSpecifiedBenchmarks();
				#else
				harris(image);
				write_to_file(HEIGHT + 2, WIDTH + 2, image);
				#endif

				free(image);

				int passed = check_output();
				if (!passed) {
				std::cout << "Verification Failed\n";
				exit(EXIT_FAILURE);
				} else {
				std::cout << "Verification Passed\n";
				exit(EXIT_SUCCESS);
				}
				}

MicroBenchmarks/harris/sha1.hpp

				/*
				sha1.hpp - header of

				============
				SHA-1 in C++
				============

				100% Public Domain.

				Original C Code
				-- Steve Reid <steve@edmweb.com>
				Small changes to fit into bglibs
				-- Bruce Guenter <bruce@untroubled.org>
				Translation to simpler C++ Code
				-- Volker Diels-Grabsch <v@njh.eu>
				Safety fixes
				-- Eugene Hopkinson <slowriot at voxelstorm dot com>
				*/

				#ifndef SHA1_HPP
				#define SHA1_HPP


				#include <cstdint>
				#include <iostream>
				#include <string>


				class SHA1
				{
				public:
				SHA1();
				void update(const std::string &s);
				void update(std::istream &is);
				std::string final();
				static std::string from_file(const std::string &filename);

				private:
				uint32_t digest[5];
				std::string buffer;
				uint64_t transforms;
				};


				#endif /* SHA1_HPP */
				MeinersburUnsubmitted Not Done Reply Inline Actions There is already hashing used by test-suite (see `HashProgramOutput.sh`), why add another one? Meinersbur: There is already hashing used by test-suite (see `HashProgramOutput.sh`), why add another one?

MicroBenchmarks/harris/sha1.cpp

				/*
				sha1.cpp - source code of

				============
				SHA-1 in C++
				============

				100% Public Domain.

				Original C Code
				-- Steve Reid <steve@edmweb.com>
				Small changes to fit into bglibs
				-- Bruce Guenter <bruce@untroubled.org>
				Translation to simpler C++ Code
				-- Volker Diels-Grabsch <v@njh.eu>
				Safety fixes
				-- Eugene Hopkinson <slowriot at voxelstorm dot com>
				*/

				#include "sha1.hpp"
				#include <sstream>
				#include <iomanip>
				#include <fstream>


				static const size_t BLOCK_INTS = 16; /* number of 32bit integers per SHA1 block */
				static const size_t BLOCK_BYTES = BLOCK_INTS * 4;


				static void reset(uint32_t digest[], std::string &buffer, uint64_t &transforms)
				{
				/* SHA1 initialization constants */
				digest[0] = 0x67452301;
				digest[1] = 0xefcdab89;
				digest[2] = 0x98badcfe;
				digest[3] = 0x10325476;
				digest[4] = 0xc3d2e1f0;

				/* Reset counters */
				buffer = "";
				transforms = 0;
				}


				static uint32_t rol(const uint32_t value, const size_t bits)
				{
				return (value << bits) \| (value >> (32 - bits));
				}


				static uint32_t blk(const uint32_t block[BLOCK_INTS], const size_t i)
				{
				return rol(block[(i+13)&15] ^ block[(i+8)&15] ^ block[(i+2)&15] ^ block[i], 1);
				}


				/*
				* (R0+R1), R2, R3, R4 are the different operations used in SHA1
				*/

				static void R0(const uint32_t block[BLOCK_INTS], const uint32_t v, uint32_t &w, const uint32_t x, const uint32_t y, uint32_t &z, const size_t i)
				{
				z += ((w&(x^y))^y) + block[i] + 0x5a827999 + rol(v, 5);
				w = rol(w, 30);
				}


				static void R1(uint32_t block[BLOCK_INTS], const uint32_t v, uint32_t &w, const uint32_t x, const uint32_t y, uint32_t &z, const size_t i)
				{
				block[i] = blk(block, i);
				z += ((w&(x^y))^y) + block[i] + 0x5a827999 + rol(v, 5);
				w = rol(w, 30);
				}


				static void R2(uint32_t block[BLOCK_INTS], const uint32_t v, uint32_t &w, const uint32_t x, const uint32_t y, uint32_t &z, const size_t i)
				{
				block[i] = blk(block, i);
				z += (w^x^y) + block[i] + 0x6ed9eba1 + rol(v, 5);
				w = rol(w, 30);
				}


				static void R3(uint32_t block[BLOCK_INTS], const uint32_t v, uint32_t &w, const uint32_t x, const uint32_t y, uint32_t &z, const size_t i)
				{
				block[i] = blk(block, i);
				z += (((w\|x)&y)\|(w&x)) + block[i] + 0x8f1bbcdc + rol(v, 5);
				w = rol(w, 30);
				}


				static void R4(uint32_t block[BLOCK_INTS], const uint32_t v, uint32_t &w, const uint32_t x, const uint32_t y, uint32_t &z, const size_t i)
				{
				block[i] = blk(block, i);
				z += (w^x^y) + block[i] + 0xca62c1d6 + rol(v, 5);
				w = rol(w, 30);
				}


				/*
				* Hash a single 512-bit block. This is the core of the algorithm.
				*/

				static void transform(uint32_t digest[], uint32_t block[BLOCK_INTS], uint64_t &transforms)
				{
				/* Copy digest[] to working vars */
				uint32_t a = digest[0];
				uint32_t b = digest[1];
				uint32_t c = digest[2];
				uint32_t d = digest[3];
				uint32_t e = digest[4];

				/* 4 rounds of 20 operations each. Loop unrolled. */
				R0(block, a, b, c, d, e, 0);
				R0(block, e, a, b, c, d, 1);
				R0(block, d, e, a, b, c, 2);
				R0(block, c, d, e, a, b, 3);
				R0(block, b, c, d, e, a, 4);
				R0(block, a, b, c, d, e, 5);
				R0(block, e, a, b, c, d, 6);
				R0(block, d, e, a, b, c, 7);
				R0(block, c, d, e, a, b, 8);
				R0(block, b, c, d, e, a, 9);
				R0(block, a, b, c, d, e, 10);
				R0(block, e, a, b, c, d, 11);
				R0(block, d, e, a, b, c, 12);
				R0(block, c, d, e, a, b, 13);
				R0(block, b, c, d, e, a, 14);
				R0(block, a, b, c, d, e, 15);
				R1(block, e, a, b, c, d, 0);
				R1(block, d, e, a, b, c, 1);
				R1(block, c, d, e, a, b, 2);
				R1(block, b, c, d, e, a, 3);
				R2(block, a, b, c, d, e, 4);
				R2(block, e, a, b, c, d, 5);
				R2(block, d, e, a, b, c, 6);
				R2(block, c, d, e, a, b, 7);
				R2(block, b, c, d, e, a, 8);
				R2(block, a, b, c, d, e, 9);
				R2(block, e, a, b, c, d, 10);
				R2(block, d, e, a, b, c, 11);
				R2(block, c, d, e, a, b, 12);
				R2(block, b, c, d, e, a, 13);
				R2(block, a, b, c, d, e, 14);
				R2(block, e, a, b, c, d, 15);
				R2(block, d, e, a, b, c, 0);
				R2(block, c, d, e, a, b, 1);
				R2(block, b, c, d, e, a, 2);
				R2(block, a, b, c, d, e, 3);
				R2(block, e, a, b, c, d, 4);
				R2(block, d, e, a, b, c, 5);
				R2(block, c, d, e, a, b, 6);
				R2(block, b, c, d, e, a, 7);
				R3(block, a, b, c, d, e, 8);
				R3(block, e, a, b, c, d, 9);
				R3(block, d, e, a, b, c, 10);
				R3(block, c, d, e, a, b, 11);
				R3(block, b, c, d, e, a, 12);
				R3(block, a, b, c, d, e, 13);
				R3(block, e, a, b, c, d, 14);
				R3(block, d, e, a, b, c, 15);
				R3(block, c, d, e, a, b, 0);
				R3(block, b, c, d, e, a, 1);
				R3(block, a, b, c, d, e, 2);
				R3(block, e, a, b, c, d, 3);
				R3(block, d, e, a, b, c, 4);
				R3(block, c, d, e, a, b, 5);
				R3(block, b, c, d, e, a, 6);
				R3(block, a, b, c, d, e, 7);
				R3(block, e, a, b, c, d, 8);
				R3(block, d, e, a, b, c, 9);
				R3(block, c, d, e, a, b, 10);
				R3(block, b, c, d, e, a, 11);
				R4(block, a, b, c, d, e, 12);
				R4(block, e, a, b, c, d, 13);
				R4(block, d, e, a, b, c, 14);
				R4(block, c, d, e, a, b, 15);
				R4(block, b, c, d, e, a, 0);
				R4(block, a, b, c, d, e, 1);
				R4(block, e, a, b, c, d, 2);
				R4(block, d, e, a, b, c, 3);
				R4(block, c, d, e, a, b, 4);
				R4(block, b, c, d, e, a, 5);
				R4(block, a, b, c, d, e, 6);
				R4(block, e, a, b, c, d, 7);
				R4(block, d, e, a, b, c, 8);
				R4(block, c, d, e, a, b, 9);
				R4(block, b, c, d, e, a, 10);
				R4(block, a, b, c, d, e, 11);
				R4(block, e, a, b, c, d, 12);
				R4(block, d, e, a, b, c, 13);
				R4(block, c, d, e, a, b, 14);
				R4(block, b, c, d, e, a, 15);

				/* Add the working vars back into digest[] */
				digest[0] += a;
				digest[1] += b;
				digest[2] += c;
				digest[3] += d;
				digest[4] += e;

				/* Count the number of transformations */
				transforms++;
				}


				static void buffer_to_block(const std::string &buffer, uint32_t block[BLOCK_INTS])
				{
				/* Convert the std::string (byte buffer) to a uint32_t array (MSB) */
				for (size_t i = 0; i < BLOCK_INTS; i++)
				{
				block[i] = (buffer[4*i+3] & 0xff)
				\| (buffer[4*i+2] & 0xff)<<8
				\| (buffer[4*i+1] & 0xff)<<16
				\| (buffer[4*i+0] & 0xff)<<24;
				}
				}


				SHA1::SHA1()
				{
				reset(digest, buffer, transforms);
				}


				void SHA1::update(const std::string &s)
				{
				std::istringstream is(s);
				update(is);
				}


				void SHA1::update(std::istream &is)
				{
				while (true)
				{
				char sbuf[BLOCK_BYTES];
				is.read(sbuf, BLOCK_BYTES - buffer.size());
				buffer.append(sbuf, is.gcount());
				if (buffer.size() != BLOCK_BYTES)
				{
				return;
				}
				uint32_t block[BLOCK_INTS];
				buffer_to_block(buffer, block);
				transform(digest, block, transforms);
				buffer.clear();
				}
				}


				/*
				* Add padding and return the message digest.
				*/

				std::string SHA1::final()
				{
				/* Total number of hashed bits */
				uint64_t total_bits = (transformsBLOCK_BYTES + buffer.size()) 8;

				/* Padding */
				buffer += 0x80;
				size_t orig_size = buffer.size();
				while (buffer.size() < BLOCK_BYTES)
				{
				buffer += (char)0x00;
				}

				uint32_t block[BLOCK_INTS];
				buffer_to_block(buffer, block);

				if (orig_size > BLOCK_BYTES - 8)
				{
				transform(digest, block, transforms);
				for (size_t i = 0; i < BLOCK_INTS - 2; i++)
				{
				block[i] = 0;
				}
				}

				/* Append total_bits, split this uint64_t into two uint32_t */
				block[BLOCK_INTS - 1] = total_bits;
				block[BLOCK_INTS - 2] = (total_bits >> 32);
				transform(digest, block, transforms);

				/* Hex std::string */
				std::ostringstream result;
				for (size_t i = 0; i < sizeof(digest) / sizeof(digest[0]); i++)
				{
				result << std::hex << std::setfill('0') << std::setw(8);
				result << digest[i];
				}

				/* Reset for next run */
				reset(digest, buffer, transforms);

				return result.str();
				}


				std::string SHA1::from_file(const std::string &filename)
				{
				std::ifstream stream(filename.c_str(), std::ios::binary);
				SHA1 checksum;
				checksum.update(stream);
				return checksum.final();
				}