This is an archive of the discontinued LLVM Phabricator instance.

[Test-Suite] Added Box Blur And Sobel Edge Detection
Needs ReviewPublic

Authored by proton on May 10 2018, 5:06 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

I will be adding some more Algorithms/Benchmarks to test suite throughout this summer (as this is my GSoC Project). I want to know if it would be better to add the benchmarks in a single new directory (Say [MultiSource||External||SingleSource]/Benchmarks/NewBenchmarks) or each new algorithm in its separate directory?

Here, I have created a separate directory for each algorithm (In SingleSource/Benchmarks/ImageProcessing/[blue||sobel]).

Reference: "https://github.com/haldos/edges/tree/master/src"

Diff Detail

Event Timeline

proton created this revision.May 10 2018, 5:06 PM
proton created this object with edit policy "Administrators".
MatzeB added a subscriber: MatzeB.May 10 2018, 5:23 PM

Are you writing these from scratch? If so, I'd like to make some suggestions:

  • Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.
  • I'd go for "MultiSource" benchmarks, even if they end up being a single sourcefile; Producing multiple executables out of a single directory of source files is more complicated than it seems.
  • Did you check the amount of time spent on initializing and printing the image compared to the main operation of the benchmark?
MultiSource/Benchmarks/7zip/CMakeLists.txt
4

Unrelated, please keep it separate.

SingleSource/Benchmarks/ImageProcessing/blur/blur.cpp
27

If you develop the benchmark from scratch it would be nice if you could design them in a way that they receive a single number as argument in a way that increasing the number makes the benchmark run longer; it's fine to print different results for different numbers.

This allows to choose a good size for different devices in the future.

SingleSource/Benchmarks/ImageProcessing/sobel/Makefile
6

In my experience benchmarks using HASH_PROGRAM_OUTPUT are very noisy (at least on the systems I care about), because we inadvertantly end up testing how fast the kernel can pipe output from a process into the next (which is the md5 utility).
Rather go for a simple checksumming mechanism as part of the sourcecode.

proton edited the summary of this revision. (Show Details)May 11 2018, 2:35 AM
proton edited the summary of this revision. (Show Details)
proton added a subscriber: pollydev.

Some context: Pankaj is a GSoC student on a project to add more Polly-optimizable benchmarks. Currently, only SingleSource/Benchmarks/Polybench is really optimizable, most other benchmarks contain some kind of pre-optimization that makes it difficult for Polly to preserve semantics, even if the algorithm itself is 100% optimizable.

This is the first patch of hopefully many others. Comments on its outer form, structure, etc. are highly appreciated before more are added. Thanks @MatzeB.

Sorry, I did not read the summary. I expected it to just contain a description of blur and sobel.

Some context: Pankaj is a GSoC student on a project to add more Polly-optimizable benchmarks. Currently, only SingleSource/Benchmarks/Polybench is really optimizable, most other benchmarks contain some kind of pre-optimization that makes it difficult for Polly to preserve semantics, even if the algorithm itself is 100% optimizable.

First, let me keep the record straight:

  • Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
  • There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
  • Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

Regarding these (and other new) benchmarks:

  • Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.
  • Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.
  • I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

Finally, regarding the actuall code, I would run clang format and remove these blocks of empty lines.

Are you writing these from scratch? If so, I'd like to make some suggestions:

  • Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

  • I'd go for "MultiSource" benchmarks, even if they end up being a single source file; Producing multiple executables out of a single directory of source files is more complicated than it seems.
  • Did you check the amount of time spent on initializing and printing the image compared to the main operation of the benchmark?

No, Do we need to submit the time values also when we add a new benchmark/kernel?
About 8% of total time that program takes was spent in the sobel_edge_detection function.

Some context: Pankaj is a GSoC student on a project to add more Polly-optimizable benchmarks. Currently, only SingleSource/Benchmarks/Polybench is really optimizable, most other benchmarks contain some kind of pre-optimization that makes it difficult for Polly to preserve semantics, even if the algorithm itself is 100% optimizable.

First, let me keep the record straight:

  • Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
  • There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
  • Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

Regarding these (and other new) benchmarks:

  • Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

This blur is similar to the blur in Halide/blur. I didn't know that there was blur already in test-suite. I will be careful next time.

  • Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

I chose the window size randomly. I wanted more computations so that program runs for more than 2 secs as small runtime have more fraction of noise in time values (due to scheduler/CPU physical condition) than a larger one. A smaller window (3 is generally taken) will also work.

  • I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

I wanted a matrix to store Image so I used a 2D array here.

Finally, regarding the actual code, I would run clang format and remove these blocks of empty lines.

I will make sure that I use clang-format from now before uploading any code here.

Are you writing these from scratch? If so, I'd like to make some suggestions:

  • Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

First, let me keep the record straight:

  • Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
  • There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
  • Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the restrict keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)

Regarding these (and other new) benchmarks:

  • Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

The Halide benchmarks are special in many regards; for instance, works only on x86.

  • Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?

  • I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

+1

Are you writing these from scratch? If so, I'd like to make some suggestions:

  • Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

It's great to have a discussion on how such benchmarks should look like.

Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already SMALL_DATASET, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has MINI_DATASET to EXTRALARGE_DATASET which are defined by workingset-size, instead of purpose or runtime.

Should we embed the kernels in a framework such as Google's, provided that it handles long runtimes and verifies correctness of the result?

SingleSource/Benchmarks/ImageProcessing/blur/blur.cpp
62

Please use a C-style arrays (int[WIDTH][HEIGHT]) or C99 Variable-Length-Arrays (VLAs) instead of array of pointers ("jagged array").

It is difficult to ensure that none of these pointers alias with each other, for Polly and any other optimizer.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

It's great to have a discussion on how such benchmarks should look like.

Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already SMALL_DATASET, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has MINI_DATASET to EXTRALARGE_DATASET which are defined by workingset-size, instead of purpose or runtime.

  • First: We have to choose a default problem size and I just wanted to emphasize that it should not be too big, so running the llvm test-suite finishes in a reasonable timeframe.
  • As I mentioned in another part of my review: I'd recommend writing the benchmarks in a way that they take a single number as a command line argument and then scale the problem size based on that number. We don't really have infrastructure for that today (I consider SMALL_DATASET a super crude tool and would rather not extend it with more variants...), but this style seems easy enough to implement to me and should allow scaling the input size up and down in case someone comes around writing better infrastructure.

First, let me keep the record straight:

  • Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
  • There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
  • Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the restrict keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)

Regarding these (and other new) benchmarks:

  • Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

The Halide benchmarks are special in many regards; for instance, works only on x86.

  • Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?

  • I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

+1

Are you writing these from scratch? If so, I'd like to make some suggestions:

  • Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

I don't see why it wouldn't work on longer-running kernels. Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

It's great to have a discussion on how such benchmarks should look like.

Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already SMALL_DATASET, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has MINI_DATASET to EXTRALARGE_DATASET which are defined by workingset-size, instead of purpose or runtime.

We already have a SMALL_PROBLEM_SIZE setting. I don't think there's anything preventing us from adding other ones, although it's not clear to me how often they'd be used.

Should we embed the kernels in a framework such as Google's, provided that it handles long runtimes and verifies correctness of the result?

I don't recall if it has a way to validate correctness.

First, let me keep the record straight:

  • Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
  • There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
  • Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the restrict keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)

Regarding these (and other new) benchmarks:

  • Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

The Halide benchmarks are special in many regards; for instance, works only on x86.

  • Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?

  • I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

+1

Are you writing these from scratch? If so, I'd like to make some suggestions:

  • Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.

But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.

The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.

Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

I don't see why it wouldn't work on longer-running kernels. Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

True, googlebenchmark solves the timing problems nicely.

BTW: Does someone have experience how stable/well it works to evaluate code size for microbenchmarks or pointing profiling tools at them?

I don't see why it wouldn't work on longer-running kernels.

It might run the kernel just once. That is, we only get results from a cold cache.

Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

High-complexity algorithms such as naive matrix determinant may require more time for problems larger than the last-level cache.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

If someone enables auto-parallelization, they probably should leave (at least some) cores available.
For continuous integration, correctness is much more important such that such bots would run using a safe (in the sense that a missed optimization still executes in reasonable time). For dedicated benchmarking, we should select a larger problem size.
That is, the default configuration can be "safe" while having larger problem sizes (including just running more often like google benchmark) for different situations.

I don't see why it wouldn't work on longer-running kernels.

It might run the kernel just once. That is, we only get results from a cold cache.

I don't believe that it will run just once. There's a minimum number of iterations (in part, as I understand it, because it needs to get an estimate of the variance).

Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

High-complexity algorithms such as naive matrix determinant may require more time for problems larger than the last-level cache.

Granted.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

If someone enables auto-parallelization, they probably should leave (at least some) cores available.

This doesn't just come up in that context. There are plenty of codes which use OpenMP or some threading-enabled library.

For continuous integration, correctness is much more important such that such bots would run using a safe (in the sense that a missed optimization still executes in reasonable time). For dedicated benchmarking, we should select a larger problem size.
That is, the default configuration can be "safe" while having larger problem sizes (including just running more often like google benchmark) for different situations.

I'm not talking about CI for correctness, although we should obviously do that too, but doing regular performance monitoring.