This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
MultiSource/Benchmarks/7zip/
-
Benchmarks/
-
7zip/
1
CMakeLists.txt
-
SingleSource/Benchmarks/
-
Benchmarks/
-
CMakeLists.txt
-
ImageProcessing/
-
CMakeLists.txt
-
Makefile
-
blur/
-
CMakeLists.txt
-
Makefile
2
blur.cpp
-
blur.reference_output
-
blur.reference_output.small
-
sobel/
-
CMakeLists.txt
1
Makefile
-
sobel.cpp
-
sobel.reference_output
-
sobel.reference_output.small

Differential D46735

[Test-Suite] Added Box Blur And Sobel Edge Detection
Needs ReviewPublic

Authored by proton on May 10 2018, 5:06 PM.

Download Raw Diff

This revision needs review, but there are no reviewers specified.

Details

Reviewers: None

Summary

I will be adding some more Algorithms/Benchmarks to test suite throughout this summer (as this is my GSoC Project). I want to know if it would be better to add the benchmarks in a single new directory (Say [MultiSource||External||SingleSource]/Benchmarks/NewBenchmarks) or each new algorithm in its separate directory?

Here, I have created a separate directory for each algorithm (In SingleSource/Benchmarks/ImageProcessing/[blue||sobel]).

Reference: "https://github.com/haldos/edges/tree/master/src"

Diff Detail

Repository: rOLDT svn-test-suite

Event Timeline

proton created this revision.May 10 2018, 5:06 PM

proton created this object with edit policy "Administrators".

Herald added subscribers: llvm-commits, mgorny. · View Herald TranscriptMay 10 2018, 5:06 PM

Are you writing these from scratch? If so, I'd like to make some suggestions:

Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.
I'd go for "MultiSource" benchmarks, even if they end up being a single sourcefile; Producing multiple executables out of a single directory of source files is more complicated than it seems.
Did you check the amount of time spent on initializing and printing the image compared to the main operation of the benchmark?

MultiSource/Benchmarks/7zip/CMakeLists.txt
4	Unrelated, please keep it separate.
SingleSource/Benchmarks/ImageProcessing/blur/blur.cpp
27	If you develop the benchmark from scratch it would be nice if you could design them in a way that they receive a single number as argument in a way that increasing the number makes the benchmark run longer; it's fine to print different results for different numbers. This allows to choose a good size for different devices in the future.
SingleSource/Benchmarks/ImageProcessing/sobel/Makefile
6	In my experience benchmarks using `HASH_PROGRAM_OUTPUT` are very noisy (at least on the systems I care about), because we inadvertantly end up testing how fast the kernel can pipe output from a process into the next (which is the md5 utility). Rather go for a simple checksumming mechanism as part of the sourcecode.

proton edited the summary of this revision. (Show Details)May 11 2018, 2:35 AM

proton edited the summary of this revision. (Show Details)

proton added a subscriber: pollydev.

homerdin added a subscriber: homerdin.May 11 2018, 6:35 AM

Some context: Pankaj is a GSoC student on a project to add more Polly-optimizable benchmarks. Currently, only SingleSource/Benchmarks/Polybench is really optimizable, most other benchmarks contain some kind of pre-optimization that makes it difficult for Polly to preserve semantics, even if the algorithm itself is 100% optimizable.

This is the first patch of hopefully many others. Comments on its outer form, structure, etc. are highly appreciated before more are added. Thanks @MatzeB.

Sorry, I did not read the summary. I expected it to just contain a description of blur and sobel.

In D46735#1096084, @Meinersbur wrote:

Some context: Pankaj is a GSoC student on a project to add more Polly-optimizable benchmarks. Currently, only SingleSource/Benchmarks/Polybench is really optimizable, most other benchmarks contain some kind of pre-optimization that makes it difficult for Polly to preserve semantics, even if the algorithm itself is 100% optimizable.

First, let me keep the record straight:

Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

Regarding these (and other new) benchmarks:

Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.
Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.
I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

Finally, regarding the actuall code, I would run clang format and remove these blocks of empty lines.

In D46735#1095482, @MatzeB wrote:

Are you writing these from scratch? If so, I'd like to make some suggestions:

Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

I'd go for "MultiSource" benchmarks, even if they end up being a single source file; Producing multiple executables out of a single directory of source files is more complicated than it seems.

Did you check the amount of time spent on initializing and printing the image compared to the main operation of the benchmark?

No, Do we need to submit the time values also when we add a new benchmark/kernel?
About 8% of total time that program takes was spent in the sobel_edge_detection function.

In D46735#1096128, @jdoerfert wrote:

In D46735#1096084, @Meinersbur wrote:

Some context: Pankaj is a GSoC student on a project to add more Polly-optimizable benchmarks. Currently, only SingleSource/Benchmarks/Polybench is really optimizable, most other benchmarks contain some kind of pre-optimization that makes it difficult for Polly to preserve semantics, even if the algorithm itself is 100% optimizable.

First, let me keep the record straight:

Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.

There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).

Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

Regarding these (and other new) benchmarks:

Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

This blur is similar to the blur in Halide/blur. I didn't know that there was blur already in test-suite. I will be careful next time.

Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

I chose the window size randomly. I wanted more computations so that program runs for more than 2 secs as small runtime have more fraction of noise in time values (due to scheduler/CPU physical condition) than a larger one. A smaller window (3 is generally taken) will also work.

I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

I wanted a matrix to store Image so I used a 2D array here.

Finally, regarding the actual code, I would run clang format and remove these blocks of empty lines.

I will make sure that I use clang-format from now before uploading any code here.

In D46735#1096464, @proton wrote:

In D46735#1095482, @MatzeB wrote:

Are you writing these from scratch? If so, I'd like to make some suggestions:

Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

In D46735#1096128, @jdoerfert wrote:

First, let me keep the record straight:

Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.

There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).

Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the restrict keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)

Regarding these (and other new) benchmarks:

Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

The Halide benchmarks are special in many regards; for instance, works only on x86.

Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?

I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

In D46735#1096497, @hfinkel wrote:

In D46735#1096464, @proton wrote:

In D46735#1095482, @MatzeB wrote:

Are you writing these from scratch? If so, I'd like to make some suggestions:

Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

It's great to have a discussion on how such benchmarks should look like.

Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already SMALL_DATASET, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has MINI_DATASET to EXTRALARGE_DATASET which are defined by workingset-size, instead of purpose or runtime.

Should we embed the kernels in a framework such as Google's, provided that it handles long runtimes and verifies correctness of the result?

SingleSource/Benchmarks/ImageProcessing/blur/blur.cpp
62	Please use a C-style arrays (int[WIDTH][HEIGHT]) or C99 Variable-Length-Arrays (VLAs) instead of array of pointers ("jagged array"). It is difficult to ensure that none of these pointers alias with each other, for Polly and any other optimizer.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

It's great to have a discussion on how such benchmarks should look like.

Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already SMALL_DATASET, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has MINI_DATASET to EXTRALARGE_DATASET which are defined by workingset-size, instead of purpose or runtime.

First: We have to choose a default problem size and I just wanted to emphasize that it should not be too big, so running the llvm test-suite finishes in a reasonable timeframe.

As I mentioned in another part of my review: I'd recommend writing the benchmarks in a way that they take a single number as a command line argument and then scale the problem size based on that number. We don't really have infrastructure for that today (I consider SMALL_DATASET a super crude tool and would rather not extend it with more variants...), but this style seems easy enough to implement to me and should allow scaling the input size up and down in case someone comes around writing better infrastructure.

In D46735#1096703, @Meinersbur wrote:

In D46735#1096128, @jdoerfert wrote:

First, let me keep the record straight:

Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.

There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).

Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the restrict keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)

Regarding these (and other new) benchmarks:

Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

The Halide benchmarks are special in many regards; for instance, works only on x86.

Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?

I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

+1

In D46735#1096497, @hfinkel wrote:

In D46735#1096464, @proton wrote:

In D46735#1095482, @MatzeB wrote:

Are you writing these from scratch? If so, I'd like to make some suggestions:

Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

I don't see why it wouldn't work on longer-running kernels. Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

It's great to have a discussion on how such benchmarks should look like.

Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already SMALL_DATASET, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has MINI_DATASET to EXTRALARGE_DATASET which are defined by workingset-size, instead of purpose or runtime.

We already have a SMALL_PROBLEM_SIZE setting. I don't think there's anything preventing us from adding other ones, although it's not clear to me how often they'd be used.

Should we embed the kernels in a framework such as Google's, provided that it handles long runtimes and verifies correctness of the result?

I don't recall if it has a way to validate correctness.

In D46735#1096782, @hfinkel wrote:

In D46735#1096703, @Meinersbur wrote:

In D46735#1096128, @jdoerfert wrote:

First, let me keep the record straight:

Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.

There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).

Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the restrict keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)

Regarding these (and other new) benchmarks:

Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

The Halide benchmarks are special in many regards; for instance, works only on x86.

Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?

I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

+1

In D46735#1096497, @hfinkel wrote:

In D46735#1096464, @proton wrote:

In D46735#1095482, @MatzeB wrote:

Are you writing these from scratch? If so, I'd like to make some suggestions:

Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

I don't see why it wouldn't work on longer-running kernels. Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

True, googlebenchmark solves the timing problems nicely.

BTW: Does someone have experience how stable/well it works to evaluate code size for microbenchmarks or pointing profiling tools at them?

In D46735#1096782, @hfinkel wrote:

I don't see why it wouldn't work on longer-running kernels.

It might run the kernel just once. That is, we only get results from a cold cache.

Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

High-complexity algorithms such as naive matrix determinant may require more time for problems larger than the last-level cache.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

If someone enables auto-parallelization, they probably should leave (at least some) cores available.
For continuous integration, correctness is much more important such that such bots would run using a safe (in the sense that a missed optimization still executes in reasonable time). For dedicated benchmarking, we should select a larger problem size.
That is, the default configuration can be "safe" while having larger problem sizes (including just running more often like google benchmark) for different situations.

In D46735#1096808, @Meinersbur wrote:

In D46735#1096782, @hfinkel wrote:

I don't see why it wouldn't work on longer-running kernels.

It might run the kernel just once. That is, we only get results from a cold cache.

I don't believe that it will run just once. There's a minimum number of iterations (in part, as I understand it, because it needs to get an estimate of the variance).

Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

High-complexity algorithms such as naive matrix determinant may require more time for problems larger than the last-level cache.

Granted.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

If someone enables auto-parallelization, they probably should leave (at least some) cores available.

This doesn't just come up in that context. There are plenty of codes which use OpenMP or some threading-enabled library.

For continuous integration, correctness is much more important such that such bots would run using a safe (in the sense that a missed optimization still executes in reasonable time). For dedicated benchmarking, we should select a larger problem size.
That is, the default configuration can be "safe" while having larger problem sizes (including just running more often like google benchmark) for different situations.

I'm not talking about CI for correctness, although we should obviously do that too, but doing regular performance monitoring.

Revision Contents

Path

Size

MultiSource/

Benchmarks/

7zip/

CMakeLists.txt

2 lines

SingleSource/

Benchmarks/

CMakeLists.txt

1 line

ImageProcessing/

CMakeLists.txt

2 lines

Makefile

9 lines

blur/

CMakeLists.txt

5 lines

Makefile

5 lines

blur.cpp

141 lines

blur.reference_output

1 line

blur.reference_output.small

1 line

sobel/

CMakeLists.txt

5 lines

Makefile

9 lines

sobel.cpp

146 lines

sobel.reference_output

1 line

sobel.reference_output.small

1 line

Diff 146221

MultiSource/Benchmarks/7zip/CMakeLists.txt

	set(PROG 7zip-benchmark)			set(PROG 7zip-benchmark)
	set(RUN_OPTIONS b)			set(RUN_OPTIONS b)
	list(APPEND CFLAGS -DBREAK_HANDLER -DUNICODE -D_UNICODE -I${CMAKE_CURRENT_SOURCE_DIR}/C -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/myWindows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/include_windows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP -I. -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DNDEBUG -D_REENTRANT -DENV_UNIX -D_7ZIP_LARGE_PAGES -pthread)			list(APPEND CFLAGS -DBREAK_HANDLER -DUNICODE -D_UNICODE -I${CMAKE_CURRENT_SOURCE_DIR}/C -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/myWindows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/include_windows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP -I. -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DNDEBUG -D_REENTRANT -DENV_UNIX -D_7ZIP_LARGE_PAGES -pthread)
	list(APPEND CXXFLAGS -Wno-error=c++11-narrowing -DBREAK_HANDLER -DUNICODE -D_UNICODE -I${CMAKE_CURRENT_SOURCE_DIR}/C -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/myWindows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/include_windows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP -I. -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DNDEBUG -D_REENTRANT -DENV_UNIX -D_7ZIP_LARGE_PAGES -pthread)			list(APPEND CXXFLAGS -Wno-error=narrowing -DBREAK_HANDLER -DUNICODE -D_UNICODE -I${CMAKE_CURRENT_SOURCE_DIR}/C -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/myWindows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP/include_windows -I${CMAKE_CURRENT_SOURCE_DIR}/CPP -I. -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DNDEBUG -D_REENTRANT -DENV_UNIX -D_7ZIP_LARGE_PAGES -pthread)
				MatzeBUnsubmitted Not Done Reply Inline Actions Unrelated, please keep it separate. MatzeB: Unrelated, please keep it separate.
	list(APPEND LDFLAGS -lstdc++ -pthread)			list(APPEND LDFLAGS -lstdc++ -pthread)
	set(Source CPP/myWindows/myGetTickCount.cpp CPP/myWindows/wine_date_and_time.cpp CPP/myWindows/myAddExeFlag.cpp CPP/myWindows/mySplitCommandLine.cpp CPP/7zip/UI/Console/BenchCon.cpp CPP/7zip/UI/Console/ConsoleClose.cpp CPP/7zip/UI/Console/ExtractCallbackConsole.cpp CPP/7zip/UI/Console/List.cpp CPP/7zip/UI/Console/Main.cpp CPP/7zip/UI/Console/MainAr.cpp CPP/7zip/UI/Console/OpenCallbackConsole.cpp CPP/7zip/UI/Console/PercentPrinter.cpp CPP/7zip/UI/Console/UpdateCallbackConsole.cpp CPP/7zip/UI/Console/UserInputUtils.cpp CPP/Common/CommandLineParser.cpp CPP/Common/CRC.cpp CPP/Common/IntToString.cpp CPP/Common/ListFileUtils.cpp CPP/Common/StdInStream.cpp CPP/Common/StdOutStream.cpp CPP/Common/MyString.cpp CPP/Common/StringConvert.cpp CPP/Common/StringToInt.cpp CPP/Common/UTFConvert.cpp CPP/Common/MyWindows.cpp CPP/Common/MyVector.cpp CPP/Common/Wildcard.cpp CPP/Windows/Error.cpp CPP/Windows/FileDir.cpp CPP/Windows/FileFind.cpp CPP/Windows/FileIO.cpp CPP/Windows/FileName.cpp CPP/Windows/PropVariant.cpp CPP/Windows/PropVariantConversions.cpp CPP/Windows/Synchronization.cpp CPP/Windows/System.cpp CPP/Windows/Time.cpp CPP/7zip/Common/CreateCoder.cpp CPP/7zip/Common/CWrappers.cpp CPP/7zip/Common/FilePathAutoRename.cpp CPP/7zip/Common/FileStreams.cpp CPP/7zip/Common/FilterCoder.cpp CPP/7zip/Common/InBuffer.cpp CPP/7zip/Common/InOutTempBuffer.cpp CPP/7zip/Common/LimitedStreams.cpp CPP/7zip/Common/LockedStream.cpp CPP/7zip/Common/MemBlocks.cpp CPP/7zip/Common/MethodId.cpp CPP/7zip/Common/MethodProps.cpp CPP/7zip/Common/OffsetStream.cpp CPP/7zip/Common/OutBuffer.cpp CPP/7zip/Common/OutMemStream.cpp CPP/7zip/Common/ProgressMt.cpp CPP/7zip/Common/ProgressUtils.cpp CPP/7zip/Common/StreamBinder.cpp CPP/7zip/Common/StreamObjects.cpp CPP/7zip/Common/StreamUtils.cpp CPP/7zip/Common/VirtThread.cpp CPP/7zip/UI/Common/ArchiveCommandLine.cpp CPP/7zip/UI/Common/ArchiveExtractCallback.cpp CPP/7zip/UI/Common/ArchiveOpenCallback.cpp CPP/7zip/UI/Common/Bench.cpp CPP/7zip/UI/Common/DefaultName.cpp CPP/7zip/UI/Common/EnumDirItems.cpp CPP/7zip/UI/Common/Extract.cpp CPP/7zip/UI/Common/ExtractingFilePath.cpp CPP/7zip/UI/Common/LoadCodecs.cpp CPP/7zip/UI/Common/OpenArchive.cpp CPP/7zip/UI/Common/PropIDUtils.cpp CPP/7zip/UI/Common/SetProperties.cpp CPP/7zip/UI/Common/SortUtils.cpp CPP/7zip/UI/Common/TempFiles.cpp CPP/7zip/UI/Common/Update.cpp CPP/7zip/UI/Common/UpdateAction.cpp CPP/7zip/UI/Common/UpdateCallback.cpp CPP/7zip/UI/Common/UpdatePair.cpp CPP/7zip/UI/Common/UpdateProduce.cpp CPP/7zip/Archive/Bz2Handler.cpp CPP/7zip/Archive/DeflateProps.cpp CPP/7zip/Archive/GzHandler.cpp CPP/7zip/Archive/LzmaHandler.cpp CPP/7zip/Archive/PpmdHandler.cpp CPP/7zip/Archive/SplitHandler.cpp CPP/7zip/Archive/XzHandler.cpp CPP/7zip/Archive/ZHandler.cpp CPP/7zip/Archive/Common/CoderMixer2.cpp CPP/7zip/Archive/Common/CoderMixer2MT.cpp CPP/7zip/Archive/Common/CrossThreadProgress.cpp CPP/7zip/Archive/Common/DummyOutStream.cpp CPP/7zip/Archive/Common/FindSignature.cpp CPP/7zip/Archive/Common/HandlerOut.cpp CPP/7zip/Archive/Common/InStreamWithCRC.cpp CPP/7zip/Archive/Common/ItemNameUtils.cpp CPP/7zip/Archive/Common/MultiStream.cpp CPP/7zip/Archive/Common/OutStreamWithCRC.cpp CPP/7zip/Archive/Common/ParseProperties.cpp CPP/7zip/Archive/7z/7zCompressionMode.cpp CPP/7zip/Archive/7z/7zDecode.cpp CPP/7zip/Archive/7z/7zEncode.cpp CPP/7zip/Archive/7z/7zExtract.cpp CPP/7zip/Archive/7z/7zFolderInStream.cpp CPP/7zip/Archive/7z/7zFolderOutStream.cpp CPP/7zip/Archive/7z/7zHandler.cpp CPP/7zip/Archive/7z/7zHandlerOut.cpp CPP/7zip/Archive/7z/7zHeader.cpp CPP/7zip/Archive/7z/7zIn.cpp CPP/7zip/Archive/7z/7zOut.cpp CPP/7zip/Archive/7z/7zProperties.cpp CPP/7zip/Archive/7z/7zSpecStream.cpp CPP/7zip/Archive/7z/7zUpdate.cpp CPP/7zip/Archive/7z/7zRegister.cpp CPP/7zip/Archive/Cab/CabBlockInStream.cpp CPP/7zip/Archive/Cab/CabHandler.cpp CPP/7zip/Archive/Cab/CabHeader.cpp CPP/7zip/Archive/Cab/CabIn.cpp CPP/7zip/Archive/Cab/CabRegister.cpp CPP/7zip/Archive/Tar/TarHandler.cpp CPP/7zip/Archive/Tar/TarHandlerOut.cpp CPP/7zip/Archive/Tar/TarHeader.cpp CPP/7zip/Archive/Tar/TarIn.cpp CPP/7zip/Archive/Tar/TarOut.cpp CPP/7zip/Archive/Tar/TarUpdate.cpp CPP/7zip/Archive/Tar/TarRegister.cpp CPP/7zip/Archive/Zip/ZipAddCommon.cpp CPP/7zip/Archive/Zip/ZipHandler.cpp CPP/7zip/Archive/Zip/ZipHandlerOut.cpp CPP/7zip/Archive/Zip/ZipHeader.cpp CPP/7zip/Archive/Zip/ZipIn.cpp CPP/7zip/Archive/Zip/ZipItem.cpp CPP/7zip/Archive/Zip/ZipOut.cpp CPP/7zip/Archive/Zip/ZipUpdate.cpp CPP/7zip/Archive/Zip/ZipRegister.cpp CPP/7zip/Compress/Bcj2Coder.cpp CPP/7zip/Compress/Bcj2Register.cpp CPP/7zip/Compress/BcjCoder.cpp CPP/7zip/Compress/BcjRegister.cpp CPP/7zip/Compress/BitlDecoder.cpp CPP/7zip/Compress/BranchCoder.cpp CPP/7zip/Compress/BranchMisc.cpp CPP/7zip/Compress/BranchRegister.cpp CPP/7zip/Compress/ByteSwap.cpp CPP/7zip/Compress/BZip2Crc.cpp CPP/7zip/Compress/BZip2Decoder.cpp CPP/7zip/Compress/BZip2Encoder.cpp CPP/7zip/Compress/BZip2Register.cpp CPP/7zip/Compress/CopyCoder.cpp CPP/7zip/Compress/CopyRegister.cpp CPP/7zip/Compress/Deflate64Register.cpp CPP/7zip/Compress/DeflateDecoder.cpp CPP/7zip/Compress/DeflateEncoder.cpp CPP/7zip/Compress/DeflateRegister.cpp CPP/7zip/Compress/DeltaFilter.cpp CPP/7zip/Compress/ImplodeDecoder.cpp CPP/7zip/Compress/ImplodeHuffmanDecoder.cpp CPP/7zip/Compress/Lzma2Decoder.cpp CPP/7zip/Compress/Lzma2Encoder.cpp CPP/7zip/Compress/Lzma2Register.cpp CPP/7zip/Compress/LzmaDecoder.cpp CPP/7zip/Compress/LzmaEncoder.cpp CPP/7zip/Compress/LzmaRegister.cpp CPP/7zip/Compress/LzOutWindow.cpp CPP/7zip/Compress/Lzx86Converter.cpp CPP/7zip/Compress/LzxDecoder.cpp CPP/7zip/Compress/PpmdDecoder.cpp CPP/7zip/Compress/PpmdEncoder.cpp CPP/7zip/Compress/PpmdRegister.cpp CPP/7zip/Compress/PpmdZip.cpp CPP/7zip/Compress/QuantumDecoder.cpp CPP/7zip/Compress/ShrinkDecoder.cpp CPP/7zip/Compress/ZDecoder.cpp CPP/7zip/Crypto/7zAes.cpp CPP/7zip/Crypto/7zAesRegister.cpp CPP/7zip/Crypto/HmacSha1.cpp CPP/7zip/Crypto/MyAes.cpp CPP/7zip/Crypto/Pbkdf2HmacSha1.cpp CPP/7zip/Crypto/RandGen.cpp CPP/7zip/Crypto/Sha1.cpp CPP/7zip/Crypto/WzAes.cpp CPP/7zip/Crypto/ZipCrypto.cpp CPP/7zip/Crypto/ZipStrong.cpp C/7zStream.c C/Aes.c C/Alloc.c C/Bra.c C/Bra86.c C/BraIA64.c C/BwtSort.c C/Delta.c C/HuffEnc.c C/LzFind.c C/LzFindMt.c C/Lzma2Dec.c C/Lzma2Enc.c C/LzmaDec.c C/LzmaEnc.c C/MtCoder.c C/Ppmd7.c C/Ppmd7Dec.c C/Ppmd7Enc.c C/Ppmd8.c C/Ppmd8Dec.c C/Ppmd8Enc.c C/Sha256.c C/Sort.c C/Threads.c C/Xz.c C/XzCrc64.c C/XzDec.c C/XzEnc.c C/XzIn.c C/7zCrc.c C/7zCrcOpt.c)			set(Source CPP/myWindows/myGetTickCount.cpp CPP/myWindows/wine_date_and_time.cpp CPP/myWindows/myAddExeFlag.cpp CPP/myWindows/mySplitCommandLine.cpp CPP/7zip/UI/Console/BenchCon.cpp CPP/7zip/UI/Console/ConsoleClose.cpp CPP/7zip/UI/Console/ExtractCallbackConsole.cpp CPP/7zip/UI/Console/List.cpp CPP/7zip/UI/Console/Main.cpp CPP/7zip/UI/Console/MainAr.cpp CPP/7zip/UI/Console/OpenCallbackConsole.cpp CPP/7zip/UI/Console/PercentPrinter.cpp CPP/7zip/UI/Console/UpdateCallbackConsole.cpp CPP/7zip/UI/Console/UserInputUtils.cpp CPP/Common/CommandLineParser.cpp CPP/Common/CRC.cpp CPP/Common/IntToString.cpp CPP/Common/ListFileUtils.cpp CPP/Common/StdInStream.cpp CPP/Common/StdOutStream.cpp CPP/Common/MyString.cpp CPP/Common/StringConvert.cpp CPP/Common/StringToInt.cpp CPP/Common/UTFConvert.cpp CPP/Common/MyWindows.cpp CPP/Common/MyVector.cpp CPP/Common/Wildcard.cpp CPP/Windows/Error.cpp CPP/Windows/FileDir.cpp CPP/Windows/FileFind.cpp CPP/Windows/FileIO.cpp CPP/Windows/FileName.cpp CPP/Windows/PropVariant.cpp CPP/Windows/PropVariantConversions.cpp CPP/Windows/Synchronization.cpp CPP/Windows/System.cpp CPP/Windows/Time.cpp CPP/7zip/Common/CreateCoder.cpp CPP/7zip/Common/CWrappers.cpp CPP/7zip/Common/FilePathAutoRename.cpp CPP/7zip/Common/FileStreams.cpp CPP/7zip/Common/FilterCoder.cpp CPP/7zip/Common/InBuffer.cpp CPP/7zip/Common/InOutTempBuffer.cpp CPP/7zip/Common/LimitedStreams.cpp CPP/7zip/Common/LockedStream.cpp CPP/7zip/Common/MemBlocks.cpp CPP/7zip/Common/MethodId.cpp CPP/7zip/Common/MethodProps.cpp CPP/7zip/Common/OffsetStream.cpp CPP/7zip/Common/OutBuffer.cpp CPP/7zip/Common/OutMemStream.cpp CPP/7zip/Common/ProgressMt.cpp CPP/7zip/Common/ProgressUtils.cpp CPP/7zip/Common/StreamBinder.cpp CPP/7zip/Common/StreamObjects.cpp CPP/7zip/Common/StreamUtils.cpp CPP/7zip/Common/VirtThread.cpp CPP/7zip/UI/Common/ArchiveCommandLine.cpp CPP/7zip/UI/Common/ArchiveExtractCallback.cpp CPP/7zip/UI/Common/ArchiveOpenCallback.cpp CPP/7zip/UI/Common/Bench.cpp CPP/7zip/UI/Common/DefaultName.cpp CPP/7zip/UI/Common/EnumDirItems.cpp CPP/7zip/UI/Common/Extract.cpp CPP/7zip/UI/Common/ExtractingFilePath.cpp CPP/7zip/UI/Common/LoadCodecs.cpp CPP/7zip/UI/Common/OpenArchive.cpp CPP/7zip/UI/Common/PropIDUtils.cpp CPP/7zip/UI/Common/SetProperties.cpp CPP/7zip/UI/Common/SortUtils.cpp CPP/7zip/UI/Common/TempFiles.cpp CPP/7zip/UI/Common/Update.cpp CPP/7zip/UI/Common/UpdateAction.cpp CPP/7zip/UI/Common/UpdateCallback.cpp CPP/7zip/UI/Common/UpdatePair.cpp CPP/7zip/UI/Common/UpdateProduce.cpp CPP/7zip/Archive/Bz2Handler.cpp CPP/7zip/Archive/DeflateProps.cpp CPP/7zip/Archive/GzHandler.cpp CPP/7zip/Archive/LzmaHandler.cpp CPP/7zip/Archive/PpmdHandler.cpp CPP/7zip/Archive/SplitHandler.cpp CPP/7zip/Archive/XzHandler.cpp CPP/7zip/Archive/ZHandler.cpp CPP/7zip/Archive/Common/CoderMixer2.cpp CPP/7zip/Archive/Common/CoderMixer2MT.cpp CPP/7zip/Archive/Common/CrossThreadProgress.cpp CPP/7zip/Archive/Common/DummyOutStream.cpp CPP/7zip/Archive/Common/FindSignature.cpp CPP/7zip/Archive/Common/HandlerOut.cpp CPP/7zip/Archive/Common/InStreamWithCRC.cpp CPP/7zip/Archive/Common/ItemNameUtils.cpp CPP/7zip/Archive/Common/MultiStream.cpp CPP/7zip/Archive/Common/OutStreamWithCRC.cpp CPP/7zip/Archive/Common/ParseProperties.cpp CPP/7zip/Archive/7z/7zCompressionMode.cpp CPP/7zip/Archive/7z/7zDecode.cpp CPP/7zip/Archive/7z/7zEncode.cpp CPP/7zip/Archive/7z/7zExtract.cpp CPP/7zip/Archive/7z/7zFolderInStream.cpp CPP/7zip/Archive/7z/7zFolderOutStream.cpp CPP/7zip/Archive/7z/7zHandler.cpp CPP/7zip/Archive/7z/7zHandlerOut.cpp CPP/7zip/Archive/7z/7zHeader.cpp CPP/7zip/Archive/7z/7zIn.cpp CPP/7zip/Archive/7z/7zOut.cpp CPP/7zip/Archive/7z/7zProperties.cpp CPP/7zip/Archive/7z/7zSpecStream.cpp CPP/7zip/Archive/7z/7zUpdate.cpp CPP/7zip/Archive/7z/7zRegister.cpp CPP/7zip/Archive/Cab/CabBlockInStream.cpp CPP/7zip/Archive/Cab/CabHandler.cpp CPP/7zip/Archive/Cab/CabHeader.cpp CPP/7zip/Archive/Cab/CabIn.cpp CPP/7zip/Archive/Cab/CabRegister.cpp CPP/7zip/Archive/Tar/TarHandler.cpp CPP/7zip/Archive/Tar/TarHandlerOut.cpp CPP/7zip/Archive/Tar/TarHeader.cpp CPP/7zip/Archive/Tar/TarIn.cpp CPP/7zip/Archive/Tar/TarOut.cpp CPP/7zip/Archive/Tar/TarUpdate.cpp CPP/7zip/Archive/Tar/TarRegister.cpp CPP/7zip/Archive/Zip/ZipAddCommon.cpp CPP/7zip/Archive/Zip/ZipHandler.cpp CPP/7zip/Archive/Zip/ZipHandlerOut.cpp CPP/7zip/Archive/Zip/ZipHeader.cpp CPP/7zip/Archive/Zip/ZipIn.cpp CPP/7zip/Archive/Zip/ZipItem.cpp CPP/7zip/Archive/Zip/ZipOut.cpp CPP/7zip/Archive/Zip/ZipUpdate.cpp CPP/7zip/Archive/Zip/ZipRegister.cpp CPP/7zip/Compress/Bcj2Coder.cpp CPP/7zip/Compress/Bcj2Register.cpp CPP/7zip/Compress/BcjCoder.cpp CPP/7zip/Compress/BcjRegister.cpp CPP/7zip/Compress/BitlDecoder.cpp CPP/7zip/Compress/BranchCoder.cpp CPP/7zip/Compress/BranchMisc.cpp CPP/7zip/Compress/BranchRegister.cpp CPP/7zip/Compress/ByteSwap.cpp CPP/7zip/Compress/BZip2Crc.cpp CPP/7zip/Compress/BZip2Decoder.cpp CPP/7zip/Compress/BZip2Encoder.cpp CPP/7zip/Compress/BZip2Register.cpp CPP/7zip/Compress/CopyCoder.cpp CPP/7zip/Compress/CopyRegister.cpp CPP/7zip/Compress/Deflate64Register.cpp CPP/7zip/Compress/DeflateDecoder.cpp CPP/7zip/Compress/DeflateEncoder.cpp CPP/7zip/Compress/DeflateRegister.cpp CPP/7zip/Compress/DeltaFilter.cpp CPP/7zip/Compress/ImplodeDecoder.cpp CPP/7zip/Compress/ImplodeHuffmanDecoder.cpp CPP/7zip/Compress/Lzma2Decoder.cpp CPP/7zip/Compress/Lzma2Encoder.cpp CPP/7zip/Compress/Lzma2Register.cpp CPP/7zip/Compress/LzmaDecoder.cpp CPP/7zip/Compress/LzmaEncoder.cpp CPP/7zip/Compress/LzmaRegister.cpp CPP/7zip/Compress/LzOutWindow.cpp CPP/7zip/Compress/Lzx86Converter.cpp CPP/7zip/Compress/LzxDecoder.cpp CPP/7zip/Compress/PpmdDecoder.cpp CPP/7zip/Compress/PpmdEncoder.cpp CPP/7zip/Compress/PpmdRegister.cpp CPP/7zip/Compress/PpmdZip.cpp CPP/7zip/Compress/QuantumDecoder.cpp CPP/7zip/Compress/ShrinkDecoder.cpp CPP/7zip/Compress/ZDecoder.cpp CPP/7zip/Crypto/7zAes.cpp CPP/7zip/Crypto/7zAesRegister.cpp CPP/7zip/Crypto/HmacSha1.cpp CPP/7zip/Crypto/MyAes.cpp CPP/7zip/Crypto/Pbkdf2HmacSha1.cpp CPP/7zip/Crypto/RandGen.cpp CPP/7zip/Crypto/Sha1.cpp CPP/7zip/Crypto/WzAes.cpp CPP/7zip/Crypto/ZipCrypto.cpp CPP/7zip/Crypto/ZipStrong.cpp C/7zStream.c C/Aes.c C/Alloc.c C/Bra.c C/Bra86.c C/BraIA64.c C/BwtSort.c C/Delta.c C/HuffEnc.c C/LzFind.c C/LzFindMt.c C/Lzma2Dec.c C/Lzma2Enc.c C/LzmaDec.c C/LzmaEnc.c C/MtCoder.c C/Ppmd7.c C/Ppmd7Dec.c C/Ppmd7Enc.c C/Ppmd8.c C/Ppmd8Dec.c C/Ppmd8Enc.c C/Sha256.c C/Sort.c C/Threads.c C/Xz.c C/XzCrc64.c C/XzDec.c C/XzEnc.c C/XzIn.c C/7zCrc.c C/7zCrcOpt.c)
	set(PROGRAM_IS_NONDETERMINISTIC 1)			set(PROGRAM_IS_NONDETERMINISTIC 1)
	llvm_multisource()			llvm_multisource()

SingleSource/Benchmarks/CMakeLists.txt

	add_subdirectory(Adobe-C++)			add_subdirectory(Adobe-C++)
	add_subdirectory(BenchmarkGame)			add_subdirectory(BenchmarkGame)
	add_subdirectory(CoyoteBench)			add_subdirectory(CoyoteBench)
	add_subdirectory(Dhrystone)			add_subdirectory(Dhrystone)
	add_subdirectory(Linpack)			add_subdirectory(Linpack)
	add_subdirectory(McGill)			add_subdirectory(McGill)
	add_subdirectory(Misc)			add_subdirectory(Misc)
	add_subdirectory(Misc-C++)			add_subdirectory(Misc-C++)
	add_subdirectory(Misc-C++-EH)			add_subdirectory(Misc-C++-EH)
	add_subdirectory(Polybench)			add_subdirectory(Polybench)
	add_subdirectory(Shootout)			add_subdirectory(Shootout)
	add_subdirectory(Shootout-C++)			add_subdirectory(Shootout-C++)
	add_subdirectory(SmallPT)			add_subdirectory(SmallPT)
	add_subdirectory(Stanford)			add_subdirectory(Stanford)
				add_subdirectory(ImageProcessing)
				No newline at end of file

SingleSource/Benchmarks/ImageProcessing/CMakeLists.txt

				add_subdirectory(sobel)
				add_subdirectory(blur)
				No newline at end of file

SingleSource/Benchmarks/ImageProcessing/Makefile

				# SingleSource/Polybench/datamining
				# Makefile: Build all subdirectories automatically

				LEVEL = ../../..
				PARALLEL_DIRS = sobel blur

				include $(LEVEL)/Makefile.config
				include $(LEVEL)/Makefile.programs

SingleSource/Benchmarks/ImageProcessing/blur/CMakeLists.txt

				set(PROG blur)
				list(APPEND LDFLAGS -lm)
				set(HASH_PROGRAM_OUTPUT 1)
				add_definitions(-DFP_ABSTOLERANCE=1e-5)
				llvm_singlesource()

SingleSource/Benchmarks/ImageProcessing/blur/Makefile

				LEVEL = ../../../../..
				PROG = blur
				HASH_PROGRAM_OUTPUT = 1
				include $(LEVEL)/SingleSource/Makefile.singlesrc

SingleSource/Benchmarks/ImageProcessing/blur/blur.cpp

				#include <iostream>
				#include <cmath>
				#include <stdlib.h>

				#ifdef SMALL_DATASET
				#define HEIGHT 1920
				#define WIDTH 1080
				#else
				#define HEIGHT 7680
				#define WIDTH 4320
				#endif


				#define WINDOW 10
				#define PRINTIMAGE(s) std::cout << s




				void init_image(int height, int width, int **image);
				void kernel_box_blur(int height, int width, int **image);
				void print_image(int height, int width, int **image);



				int main(int argc, char *argv[])
				{
				MatzeBUnsubmitted Not Done Reply Inline Actions If you develop the benchmark from scratch it would be nice if you could design them in a way that they receive a single number as argument in a way that increasing the number makes the benchmark run longer; it's fine to print different results for different numbers. This allows to choose a good size for different devices in the future. MatzeB: If you develop the benchmark from scratch it would be nice if you could design them in a way…
				int image = (int)malloc(HEIGHTsizeof(int ));

				init_image(HEIGHT, WIDTH, image);
				kernel_box_blur(HEIGHT, WIDTH, image);

				print_image( HEIGHT, WIDTH, image);

				for (int i=0; i<HEIGHT; i++)
				free(image[i]);
				free(image);

				return EXIT_SUCCESS;
				}



				void init_image(int height, int width, int **image)
				{
				// Initialize a random image
				for (int i=0; i<HEIGHT; i++){
				image[i] = (int)malloc(WIDTHsizeof(int));
				for (int j=0; j<WIDTH; j++) {
				image[i][j] = (i*j+i+j)%256; //Any Random Arbitary Input Should Work
				}
				}
				}

				void kernel_box_blur(int height, int width, int **image)
				{
				// Allocating memory for output image
				int img2dblur = (int)malloc(heightsizeof(int ));

				// Initializing output Image
				for (int i=0; i<height; i++){
				img2dblur[i] = (int)malloc(widthsizeof(int));
				MeinersburUnsubmitted Not Done Reply Inline Actions Please use a C-style arrays (int[WIDTH][HEIGHT]) or C99 Variable-Length-Arrays (VLAs) instead of array of pointers ("jagged array"). It is difficult to ensure that none of these pointers alias with each other, for Polly and any other optimizer. Meinersbur: Please use a C-style arrays (int[WIDTH][HEIGHT]) or C99 Variable-Length-Arrays (VLAs) instead…
				for (int j=0; j<width; j++) {
				img2dblur[i][j] = 0;
				}
				}

				int sum_in_window = 0;
				int window_size = WINDOW;
				int offset = (window_size-1)/2;
				int n = WINDOW*WINDOW;
				int max = -200;
				int min = 2000;

				for (int i=offset; i<height-offset; i++)
				{
				for (int j=offset; j<width-offset; j++)
				{
				/* Computing sum of elements in window centered at i,j */
				sum_in_window=0;
				for (int k= -1 * offset; k<offset; k++)
				{
				for (int l= -1 * offset; l<offset; l++)
				{
				sum_in_window += image[i+k][j+l];
				}
				}
				/* Averaging it */
				img2dblur[i][j] = (sum_in_window)/(n);
				/* Get Max and Min (to Scale it later between 0-255) */
				if (img2dblur[i][j]>max)
				max = img2dblur[i][j];
				if (img2dblur[i][j]<min)
				min = img2dblur[i][j];
				}
				}

				// Scale everything from 0-255
				int diff = max - min;

				// if max = min then image is constant all over hence no edges(0 pixels only)
				if(diff==0)
				{
				diff = 1;
				}

				for (int i=0; i<height; i++)
				{
				for (int j=0; j<width; j++)
				{
				float abc = (img2dblur[i][j]-min)/(diff*1.0);
				img2dblur[i][j] = abc* 255;
				}
				}


				for (int i=0; i<height; i++)
				{
				for (int j=0; j<width; j++)
				{
				image[i][j] = img2dblur[i][j];
				}
				}


				// Clear allocated space for image
				for (int i=0; i<height; i++) {
				free(img2dblur[i]);
				}
				free(img2dblur);
				}

				void print_image(int height, int width, int **image)
				{
				// Print Image
				for (int i=0; i<height; i++) {
				for (int j=0; j<width; j++) {
				PRINTIMAGE(image[i][j]);
				}
				}
				}
				No newline at end of file

SingleSource/Benchmarks/ImageProcessing/blur/blur.reference_output

db2b0ccdcc2de7c24152290279bdc77a

SingleSource/Benchmarks/ImageProcessing/blur/blur.reference_output.small

db2b0ccdcc2de7c24152290279bdc77a

SingleSource/Benchmarks/ImageProcessing/sobel/CMakeLists.txt

				set(PROG sobel)
				list(APPEND LDFLAGS -lm )
				set(HASH_PROGRAM_OUTPUT 1)
				add_definitions(-DFP_ABSTOLERANCE=1e-5)
				llvm_singlesource()

SingleSource/Benchmarks/ImageProcessing/sobel/Makefile

				LEVEL = ../../../../..

				PROG = sobel
				LDFLAGS += -lm

				HASH_PROGRAM_OUTPUT = 1
				MatzeBUnsubmitted Not Done Reply Inline Actions In my experience benchmarks using `HASH_PROGRAM_OUTPUT` are very noisy (at least on the systems I care about), because we inadvertantly end up testing how fast the kernel can pipe output from a process into the next (which is the md5 utility). Rather go for a simple checksumming mechanism as part of the sourcecode. MatzeB: In my experience benchmarks using `HASH_PROGRAM_OUTPUT` are very noisy (at least on the systems…

				include $(LEVEL)/SingleSource/Makefile.singlesrc

SingleSource/Benchmarks/ImageProcessing/sobel/sobel.cpp

				#include <stdio.h>
				#include <stdlib.h>
				#include <iostream>
				#include <cmath>

				using namespace std;

				#ifdef SMALL_DATASET
				#define HEIGHT 1920
				#define WIDTH 1080
				#else
				#define HEIGHT 7680
				#define WIDTH 4320
				#endif




				#define WINDOW 10
				#define PRINTIMAGE(s) std::cout << s


				void init_image(int height, int width, int **image);
				void sobel_edge_detection(int height, int width ,int **image);
				void print_image(int height, int width, int **image);



				int main(int argc, char *argv[])
				{
				int image = (int)malloc(HEIGHTsizeof(int ));

				init_image(HEIGHT, WIDTH, image);
				sobel_edge_detection(HEIGHT, WIDTH, image);

				print_image( HEIGHT, WIDTH, image);

				for (int i=0; i<HEIGHT; i++)
				free(image[i]);
				free(image);

				return EXIT_SUCCESS;
				}



				void init_image(int height, int width, int **image)
				{
				// Initialize a random image
				for (int i=0; i<HEIGHT; i++){
				image[i] = (int)malloc(WIDTHsizeof(int));
				for (int j=0; j<WIDTH; j++) {
				image[i][j] = (iWIDTHj +i+j)%256; //Any Random Arbitary Input Should Work
				}
				}
				}

				void print_image(int height, int width, int **image)
				{
				// Print Image
				for (int i=0; i<height; i++) {
				for (int j=0; j<width; j++) {
				PRINTIMAGE(image[i][j]);
				}
				}
				}








				void sobel_edge_detection(int height, int width ,int **img2d)
				{


				int img2dhororg = (int )malloc(heightsizeof(int ));
				int img2dverorg = (int )malloc(heightsizeof(int ));
				int img2dmag = (int )malloc(heightsizeof(int ));

				for (int i=0; i<height; i++){
				img2dhororg[i] = (int )malloc(widthsizeof(int));
				img2dverorg[i] = (int )malloc(widthsizeof(int));
				img2dmag[i] = (int )malloc(widthsizeof(int));
				for (int j=0; j<width; j++) {
				img2dhororg[i][j] = 0;
				img2dverorg[i][j] = 0;
				img2dmag[i][j] = 0;
				}
				}


				for (int i=1; i<height-1; i++){
				for (int j=1; j<width-1; j++) {
				img2dhororg[i][j] = img2d[i-1][j-1]+2img2d[i-1][j]+img2d[i-1][j+1]-img2d[i+1][j-1]-2img2d[i+1][j]-img2d[i+1][j+1];

				}
				}


				for (int i=1; i<height-1; i++){
				for (int j=1; j<width-1; j++) {
				img2dverorg[i][j] = img2d[i-1][j-1]+2img2d[i][j-1]+img2d[i+1][j-1]-img2d[i-1][j+1]-2img2d[i][j+1]-img2d[i+1][j+1];
				}
				}



				int max=-200, min=2000;
				for (int i=0; i<height; i++) {
				for (int j=0; j<width; j++) {
				img2dmag[i][j] = sqrt(pow(img2dhororg[i][j], 2)+pow(img2dverorg[i][j], 2));
				if (img2dmag[i][j]>max)
				max = img2dmag[i][j];
				if (img2dmag[i][j]<min)
				min = img2dmag[i][j];
				}
				}

				int diff = max - min;
				for (int i=0; i<height; i++) {
				for (int j=0; j<width; j++){
				float abc = (img2dmag[i][j]-min)/(diff*1.0);
				img2dmag[i][j] = abc* 255;
				}
				}



				// Print Image
				for (int i=0; i<height; i++) {
				for (int j=0; j<width; j++) {
				PRINTIMAGE(img2dmag[i][j]);
				}
				free(img2dhororg[i]);
				free(img2dverorg[i]);
				free(img2dmag[i]);
				}

				free(img2dhororg);
				free(img2dverorg);
				free(img2dmag);

				}
				No newline at end of file

SingleSource/Benchmarks/ImageProcessing/sobel/sobel.reference_output

21fa8ad4560977b1f2f3ec025020fcdb

SingleSource/Benchmarks/ImageProcessing/sobel/sobel.reference_output.small

21fa8ad4560977b1f2f3ec025020fcdb

This is an archive of the discontinued LLVM Phabricator instance.

[Test-Suite] Added Box Blur And Sobel Edge DetectionNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 146221

MultiSource/Benchmarks/7zip/CMakeLists.txt

SingleSource/Benchmarks/CMakeLists.txt

SingleSource/Benchmarks/ImageProcessing/CMakeLists.txt

SingleSource/Benchmarks/ImageProcessing/Makefile

SingleSource/Benchmarks/ImageProcessing/blur/CMakeLists.txt

SingleSource/Benchmarks/ImageProcessing/blur/Makefile

SingleSource/Benchmarks/ImageProcessing/blur/blur.cpp

SingleSource/Benchmarks/ImageProcessing/blur/blur.reference_output

SingleSource/Benchmarks/ImageProcessing/blur/blur.reference_output.small

SingleSource/Benchmarks/ImageProcessing/sobel/CMakeLists.txt

SingleSource/Benchmarks/ImageProcessing/sobel/Makefile

SingleSource/Benchmarks/ImageProcessing/sobel/sobel.cpp

SingleSource/Benchmarks/ImageProcessing/sobel/sobel.reference_output

SingleSource/Benchmarks/ImageProcessing/sobel/sobel.reference_output.small

[Test-Suite] Added Box Blur And Sobel Edge Detection
Needs ReviewPublic