This is an archive of the discontinued LLVM Phabricator instance.

[MicroBenchmarks] Add initial loop vectorization benchmarks.
ClosedPublic

Authored by fhahn on May 4 2021, 9:25 AM.

Details

Summary

This patch adds initial micro-benchmarks with interesting
loop-vectorization cases. To start with, it includes benchmarks using
libm math functions.

For each math function, there's a benchmark for the auto-vectorized
version and a version with vectorization disabled.

The auto-vec version of the benchmark also compares the results of the
auto-vectorized functions to the scalar versions.

Event Timeline

fhahn created this revision.May 4 2021, 9:25 AM
fhahn requested review of this revision.May 4 2021, 9:25 AM

Do you have an example output? How long does it take to execute the benchmark?

The validation uses a floating-point comparison using ==, could you add a comment that this is intended? I am a bit worried that if I compile the test-suite with CMAKE_CXX_FLAGS=-ffast-math (mostly to measure performance), this will fail.

The comparison is part of the time measurements. This doesn't seem useful to be for benchmarking the vectorized and non-vectorized versions together. If the only things we are interested in is the correctness, we don't need Google Benchmark. Use Google test instead?

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
56

I think the Inf case is already tested with !=, only NaN needs to be compared separately. Since NaN != NaN.

The comparison is part of the time measurements. This doesn't seem useful to be for benchmarking the vectorized and non-vectorized versions together.

My understanding was that googlebench only measures timing in the for (auto _ : state) loop, and that the setup before / after isn't counted.

lebedev.ri added inline comments.
MicroBenchmarks/LoopVectorization/MathFunctions.cpp
55โ€“56

I'm not confident this is

  1. consistent with the rules
  2. will withstand the whatever compiler optimization levels test-suite is compiled with

Perhaps you want to be on the safe side and do fpclassify(),
fail if they mismatch, and compare value-equality if they are normal/subnormal?

The comparison is part of the time measurements. This doesn't seem useful to be for benchmarking the vectorized and non-vectorized versions together.

My understanding was that googlebench only measures timing in the for (auto _ : state) loop, and that the setup before / after isn't counted.

Correct.

My understanding was that googlebench only measures timing in the for (auto _ : state) loop, and that the setup before / after isn't counted.

Correct; I misread the call structure thinking that run_fn_autovec was doing the comparison.

fhahn updated this revision to Diff 344347.May 11 2021, 3:56 AM

Updated to fall back to fpclassify if there's a value mis-compare, disable verification with -DTEST_SUITE_BENCHMARKING_ONLY=On or -ffast-math.

Do you have an example output? How long does it take to execute the benchmark?

Unfortunately I cannot share any detailed absolute numbers about runtime. Also, the total runtime will depend on how long it takes for google-benchmark to determine that the results are stable.

The validation uses a floating-point comparison using ==, could you add a comment that this is intended? I am a bit worried that if I compile the test-suite with CMAKE_CXX_FLAGS=-ffast-math (mostly to measure performance), this will fail.

That's a good point! I adopted the code to use fpclassify as suggested by @lebedev.ri. In addition I tried to disable verification, if either -DTEST_SUITE_BENCHMARKING_ONLY=On is used or -ffast-math is passed. I hope that's a solid enough start, as I think the verification is quite useful in the general case (without -ffast-math).

fhahn added a comment.May 11 2021, 4:01 AM
This comment was removed by fhahn.
MicroBenchmarks/LoopVectorization/MathFunctions.cpp
55โ€“56

Thanks, I updated the patch to fall back to fpclassify if there's a mis-match.

consistent with the rules

Do you mean rules for the test-suite?

will withstand the whatever compiler optimization levels test-suite is compiled with

I think at the moment, the only issue at the moment could be that the compiler picks a different vector math function with -ffast-math or the user specifies a vector library that does not guarantee the same results as the scalar version.

The -fast-math case should be handled in the latest version, but the user specified vector library is not. Not sure what we can do about that case and how/if other tests deal with that issue.

lebedev.ri accepted this revision.May 11 2021, 4:09 AM

LGTM unless others have other comments

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
11

Sure you don't want the counter to be int?

65

Not really sure if DoNotOptimize is enough, maybe you also want ClobberMemory

This revision is now accepted and ready to land.May 11 2021, 4:09 AM

Unfortunately I cannot share any detailed absolute numbers about runtime. Also, the total runtime will depend on how long it takes for google-benchmark to determine that the results are stable.

I am mainly asking because in the past we mandated that the execution time per program remains below 10 seconds to limit the execution time of the test-suite, and not give a disproportional weight to single benchmarks. Of course the execution time varies depending on the platform, like the other Google benchmarks already in the test-suite, but this shouldn't stop giving a ballpark number of a typical execution times, as we did for others (e.g. D36582, D43319, D47675, D49503)

fhahn added a comment.May 11 2021, 9:49 AM

Unfortunately I cannot share any detailed absolute numbers about runtime. Also, the total runtime will depend on how long it takes for google-benchmark to determine that the results are stable.

I am mainly asking because in the past we mandated that the execution time per program remains below 10 seconds to limit the execution time of the test-suite, and not give a disproportional weight to single benchmarks. Of course the execution time varies depending on the platform, like the other Google benchmarks already in the test-suite, but this shouldn't stop giving a ballpark number of a typical execution times, as we did for others (e.g. D36582, D43319, D47675, D49503)

Ah I see. I think the total runtime of the binary is probably more than 10s and closer to a low single digit number of minutes at the most. Each applies the math function to 100000 elements, which still should be fairly quick per benchmark

In that case, either only enable it in benchmark mode or reduce the size to run below 10 seconds.

I had discussion about this with @chandlerc and minutes-long execution times would definitely be too long for him.

fhahn added a comment.May 13 2021, 1:30 AM

In that case, either only enable it in benchmark mode or reduce the size to run below 10 seconds.

I had discussion about this with @chandlerc and minutes-long execution times would definitely be too long for him.

I understand the motivation, but this seems directly opposed to how google-benchmarks works in general: one binary with lots of benchmarks, which as a consequence means the binary will take more time, even though each individual benchmark is much shorter. I guess I can disable the benchmarking part if TEST_SUITE_RUN_TYPE=test/train / TEST_SUITE_BENCHMARKING_ONLY

As said, we already had discussions about the point of limiting execution time. I was arguing that benchmarks for L3 memory effect have to be sufficiently large. This isn't even the case here. Why not just decrease N? Whether each single math function counts as its own benchmark doesn't make that much of a difference here.

I can disable the benchmarking part if [...] TEST_SUITE_BENCHMARKING_ONLY

This was what I am suggesting.

As said, we already had discussions about the point of limiting execution time. I was arguing that benchmarks for L3 memory effect have to be sufficiently large. This isn't even the case here. Why not just decrease N?

That isn't going to affect the total runtime of this executable.

Whether each single math function counts as its own benchmark doesn't make that much of a difference here.

I can disable the benchmarking part if [...] TEST_SUITE_BENCHMARKING_ONLY

This was what I am suggesting.

I tried out the patch myself. It was consistently completing in about 37s, i.e about one second per benchmark.

Number of benchmarks: 36.
Google Benchmark MinTime default: 0.5s
Google Benchmark maximum walltime: 2.5s
That is, the total runtime ranges between 18s and 1.5 minutes.

Reading the minutes-long benchmark time, it seemed that each iteration would take longer than the 2.5s maximum wall clock time default. In that case, Google Benchmark was unable to get stable statistics, possibly because a single iteration takes longer than 2.5s. This seems to be a good argument to reduce N, to giving Google Benchmark more leeway for statistics. In contrast to e.g. LoopInterchange, there is no need to be sufficiently large to make the cache hierarchy count. Even on my system (Intel x84_64), it runs only does about 100 iterations per benchmarks, which seems low. Note that autovec and novec are approximately equally fast. -ffast-math finishes without error.

There is precedence with MemFunctions also running an even larger amount of micro benchmarks not protected by TEST_SUITE_BENCHMARKING_ONLY, so it might not be necessary here either.

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
39โ€“41
/home/meinersbur/src/llvm-test-suite/MicroBenchmarks/LoopVectorization/MathFunctions.cpp:39:8: error: no member named 'unique_ptr' in namespace 'std'
  std::unique_ptr<T[]> A(new T[N]);
  ~~~~~^

(with libstdc++)

#include <memory> should be added.

fhahn updated this revision to Diff 345858.May 17 2021, 6:35 AM

Update to reduce N, add missing include and use ClobberMemory.

I tried out the patch myself. It was consistently completing in about 37s, i.e about one second per benchmark.

Thanks for doing a run!

Number of benchmarks: 36.
Google Benchmark MinTime default: 0.5s
Google Benchmark maximum walltime: 2.5s
That is, the total runtime ranges between 18s and 1.5 minutes.

Reading the minutes-long benchmark time, it seemed that each iteration would take longer than the 2.5s maximum wall clock time default. In that case, Google Benchmark was unable to get stable statistics, possibly because a single iteration takes longer than 2.5s. This seems to be a good argument to reduce N, to giving Google Benchmark more leeway for statistics. In contrast to e.g. LoopInterchange, there is no need to be sufficiently large to make the cache hierarchy count. Even on my system (Intel x84_64), it runs only does about 100 iterations per benchmarks, which seems low. Note that autovec and novec are approximately equally fast. -ffast-math finishes without error.

I removed one 0 from N, now we should get plenty of iterations even on not so powerful systems. Without a vector library, the runtimes between autovec and novec versions is indded similar. My motivation for the benchmarks is to evaluate & track performance of various vector libraries.

There is precedence with MemFunctions also running an even larger amount of micro benchmarks not protected by TEST_SUITE_BENCHMARKING_ONLY, so it might not be necessary here either.

Sounds good, I'll go without the extra protection then for now. We can adjust if this is a problem. In the future I am planning on adding other vectorization benchmarks, unrelated to math functions. We could add separate binaries for different groups perhaps.

Meinersbur accepted this revision.May 17 2021, 8:01 AM

LGTM, thank you.

This revision was automatically updated to reflect the committed changes.