This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
MicroBenchmarks/
-
CMakeLists.txt
-
LoopVectorization/
-
CMakeLists.txt
1/6
MathFunctions.cpp
-
main.cpp

Differential D101844

[MicroBenchmarks] Add initial loop vectorization benchmarks.
ClosedPublic

Authored by fhahn on May 4 2021, 9:25 AM.

Download Raw Diff

Details

Reviewers

jroelofs
scanon
Meinersbur
lebedev.ri

Commits

rT3af231412651: [MicroBenchmarks] Add initial loop vectorization benchmarks.

Summary

This patch adds initial micro-benchmarks with interesting
loop-vectorization cases. To start with, it includes benchmarks using
libm math functions.

For each math function, there's a benchmark for the auto-vectorized
version and a version with vectorization disabled.

The auto-vec version of the benchmark also compares the results of the
auto-vectorized functions to the scalar versions.

Diff Detail

Repository

rT test-suite

Build Status

Buildable 103698
Build 146794: arc lint + arc unit

Event Timeline

fhahn created this revision.May 4 2021, 9:25 AM

Herald added a subscriber: mgorny. · View Herald TranscriptMay 4 2021, 9:25 AM

fhahn requested review of this revision.May 4 2021, 9:25 AM

Harbormaster completed remote builds in B102554: Diff 342774.May 4 2021, 9:25 AM

Do you have an example output? How long does it take to execute the benchmark?

The validation uses a floating-point comparison using ==, could you add a comment that this is intended? I am a bit worried that if I compile the test-suite with CMAKE_CXX_FLAGS=-ffast-math (mostly to measure performance), this will fail.

The comparison is part of the time measurements. This doesn't seem useful to be for benchmarking the vectorized and non-vectorized versions together. If the only things we are interested in is the correctness, we don't need Google Benchmark. Use Google test instead?

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
56	I think the `Inf` case is already tested with `!=`, only NaN needs to be compared separately. Since `NaN != NaN`.

In D101844#2736630, @Meinersbur wrote:

The comparison is part of the time measurements. This doesn't seem useful to be for benchmarking the vectorized and non-vectorized versions together.

My understanding was that googlebench only measures timing in the for (auto _ : state) loop, and that the setup before / after isn't counted.

lebedev.ri added a subscriber: lebedev.ri.May 4 2021, 9:56 AM

lebedev.ri added inline comments.

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
55–56	I'm not confident this is consistent with the rules will withstand the whatever compiler optimization levels test-suite is compiled with Perhaps you want to be on the safe side and do `fpclassify()`, fail if they mismatch, and compare value-equality if they are normal/subnormal?

In D101844#2736635, @jroelofs wrote:

In D101844#2736630, @Meinersbur wrote:

The comparison is part of the time measurements. This doesn't seem useful to be for benchmarking the vectorized and non-vectorized versions together.

My understanding was that googlebench only measures timing in the for (auto _ : state) loop, and that the setup before / after isn't counted.

Correct.

In D101844#2736635, @jroelofs wrote:

My understanding was that googlebench only measures timing in the for (auto _ : state) loop, and that the setup before / after isn't counted.

Correct; I misread the call structure thinking that run_fn_autovec was doing the comparison.

Updated to fall back to fpclassify if there's a value mis-compare, disable verification with -DTEST_SUITE_BENCHMARKING_ONLY=On or -ffast-math.

In D101844#2736630, @Meinersbur wrote:

Do you have an example output? How long does it take to execute the benchmark?

Unfortunately I cannot share any detailed absolute numbers about runtime. Also, the total runtime will depend on how long it takes for google-benchmark to determine that the results are stable.

The validation uses a floating-point comparison using ==, could you add a comment that this is intended? I am a bit worried that if I compile the test-suite with CMAKE_CXX_FLAGS=-ffast-math (mostly to measure performance), this will fail.

That's a good point! I adopted the code to use fpclassify as suggested by @lebedev.ri. In addition I tried to disable verification, if either -DTEST_SUITE_BENCHMARKING_ONLY=On is used or -ffast-math is passed. I hope that's a solid enough start, as I think the verification is quite useful in the general case (without -ffast-math).

Harbormaster completed remote builds in B103698: Diff 344347.May 11 2021, 3:56 AM

fhahn added a comment.May 11 2021, 4:01 AM

This comment was removed by fhahn.

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
55–56	Thanks, I updated the patch to fall back to `fpclassify` if there's a mis-match. consistent with the rules Do you mean rules for the test-suite? will withstand the whatever compiler optimization levels test-suite is compiled with I think at the moment, the only issue at the moment could be that the compiler picks a different vector math function with `-ffast-math` or the user specifies a vector library that does not guarantee the same results as the scalar version. The `-fast-math` case should be handled in the latest version, but the user specified vector library is not. Not sure what we can do about that case and how/if other tests deal with that issue.

LGTM unless others have other comments

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
11	Sure you don't want the counter to be `int`?
65	Not really sure if `DoNotOptimize` is enough, maybe you also want `ClobberMemory`

This revision is now accepted and ready to land.May 11 2021, 4:09 AM

In D101844#2750168, @fhahn wrote:

Unfortunately I cannot share any detailed absolute numbers about runtime. Also, the total runtime will depend on how long it takes for google-benchmark to determine that the results are stable.

I am mainly asking because in the past we mandated that the execution time per program remains below 10 seconds to limit the execution time of the test-suite, and not give a disproportional weight to single benchmarks. Of course the execution time varies depending on the platform, like the other Google benchmarks already in the test-suite, but this shouldn't stop giving a ballpark number of a typical execution times, as we did for others (e.g. D36582, D43319, D47675, D49503)

In D101844#2750869, @Meinersbur wrote:

In D101844#2750168, @fhahn wrote:

Unfortunately I cannot share any detailed absolute numbers about runtime. Also, the total runtime will depend on how long it takes for google-benchmark to determine that the results are stable.

I am mainly asking because in the past we mandated that the execution time per program remains below 10 seconds to limit the execution time of the test-suite, and not give a disproportional weight to single benchmarks. Of course the execution time varies depending on the platform, like the other Google benchmarks already in the test-suite, but this shouldn't stop giving a ballpark number of a typical execution times, as we did for others (e.g. D36582, D43319, D47675, D49503)

Ah I see. I think the total runtime of the binary is probably more than 10s and closer to a low single digit number of minutes at the most. Each applies the math function to 100000 elements, which still should be fairly quick per benchmark

In that case, either only enable it in benchmark mode or reduce the size to run below 10 seconds.

I had discussion about this with @chandlerc and minutes-long execution times would definitely be too long for him.

In D101844#2755992, @Meinersbur wrote:

In that case, either only enable it in benchmark mode or reduce the size to run below 10 seconds.

I had discussion about this with @chandlerc and minutes-long execution times would definitely be too long for him.

I understand the motivation, but this seems directly opposed to how google-benchmarks works in general: one binary with lots of benchmarks, which as a consequence means the binary will take more time, even though each individual benchmark is much shorter. I guess I can disable the benchmarking part if TEST_SUITE_RUN_TYPE=test/train / TEST_SUITE_BENCHMARKING_ONLY

As said, we already had discussions about the point of limiting execution time. I was arguing that benchmarks for L3 memory effect have to be sufficiently large. This isn't even the case here. Why not just decrease N? Whether each single math function counts as its own benchmark doesn't make that much of a difference here.

I can disable the benchmarking part if [...] TEST_SUITE_BENCHMARKING_ONLY

This was what I am suggesting.

In D101844#2757052, @Meinersbur wrote:

As said, we already had discussions about the point of limiting execution time. I was arguing that benchmarks for L3 memory effect have to be sufficiently large. This isn't even the case here. Why not just decrease N?

That isn't going to affect the total runtime of this executable.

Whether each single math function counts as its own benchmark doesn't make that much of a difference here.

I can disable the benchmarking part if [...] TEST_SUITE_BENCHMARKING_ONLY

This was what I am suggesting.

I tried out the patch myself. It was consistently completing in about 37s, i.e about one second per benchmark.

Number of benchmarks: 36.
Google Benchmark MinTime default: 0.5s
Google Benchmark maximum walltime: 2.5s
That is, the total runtime ranges between 18s and 1.5 minutes.

Reading the minutes-long benchmark time, it seemed that each iteration would take longer than the 2.5s maximum wall clock time default. In that case, Google Benchmark was unable to get stable statistics, possibly because a single iteration takes longer than 2.5s. This seems to be a good argument to reduce N, to giving Google Benchmark more leeway for statistics. In contrast to e.g. LoopInterchange, there is no need to be sufficiently large to make the cache hierarchy count. Even on my system (Intel x84_64), it runs only does about 100 iterations per benchmarks, which seems low. Note that autovec and novec are approximately equally fast. -ffast-math finishes without error.

There is precedence with MemFunctions also running an even larger amount of micro benchmarks not protected by TEST_SUITE_BENCHMARKING_ONLY, so it might not be necessary here either.

MicroBenchmarks/LoopVectorization/MathFunctions.cpp
39–41	/home/meinersbur/src/llvm-test-suite/MicroBenchmarks/LoopVectorization/MathFunctions.cpp:39:8: error: no member named 'unique_ptr' in namespace 'std' std::unique_ptr<T[]> A(new T[N]); ~~~~~^ (with libstdc++) `#include <memory>` should be added.

Update to reduce N, add missing include and use ClobberMemory.

In D101844#2760974, @Meinersbur wrote:

I tried out the patch myself. It was consistently completing in about 37s, i.e about one second per benchmark.

Thanks for doing a run!

Number of benchmarks: 36.
Google Benchmark MinTime default: 0.5s
Google Benchmark maximum walltime: 2.5s
That is, the total runtime ranges between 18s and 1.5 minutes.

Reading the minutes-long benchmark time, it seemed that each iteration would take longer than the 2.5s maximum wall clock time default. In that case, Google Benchmark was unable to get stable statistics, possibly because a single iteration takes longer than 2.5s. This seems to be a good argument to reduce N, to giving Google Benchmark more leeway for statistics. In contrast to e.g. LoopInterchange, there is no need to be sufficiently large to make the cache hierarchy count. Even on my system (Intel x84_64), it runs only does about 100 iterations per benchmarks, which seems low. Note that autovec and novec are approximately equally fast. -ffast-math finishes without error.

I removed one 0 from N, now we should get plenty of iterations even on not so powerful systems. Without a vector library, the runtimes between autovec and novec versions is indded similar. My motivation for the benchmarks is to evaluate & track performance of various vector libraries.

There is precedence with MemFunctions also running an even larger amount of micro benchmarks not protected by TEST_SUITE_BENCHMARKING_ONLY, so it might not be necessary here either.

Sounds good, I'll go without the extra protection then for now. We can adjust if this is a problem. In the future I am planning on adding other vectorization benchmarks, unrelated to math functions. We could add separate binaries for different groups perhaps.

Harbormaster completed remote builds in B104811: Diff 345858.May 17 2021, 6:35 AM

LGTM, thank you.

Closed by commit rT3af231412651: [MicroBenchmarks] Add initial loop vectorization benchmarks. (authored by fhahn). · Explain WhyMay 17 2021, 10:44 AM

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rT3af231412651: [MicroBenchmarks] Add initial loop vectorization benchmarks..

fhahn mentioned this in D102834: [SLP] Implement initial memory versioning..Jun 7 2021, 8:43 AM

fhahn mentioned this in D104126: [MicroBenchmarks] Add initial SLP vectorization benchmarks..Jun 11 2021, 9:03 AM

fhahn mentioned this in rT9eda02822fb3: [MicroBenchmarks] Add initial SLP vectorization benchmarks..Jul 30 2021, 2:02 AM

Revision Contents

Path

Size

MicroBenchmarks/

CMakeLists.txt

1 line

LoopVectorization/

CMakeLists.txt

16 lines

MathFunctions.cpp

126 lines

main.cpp

8 lines

Diff 344347

MicroBenchmarks/CMakeLists.txt

	file(COPY lit.local.cfg DESTINATION ${CMAKE_CURRENT_BINARY_DIR})			file(COPY lit.local.cfg DESTINATION ${CMAKE_CURRENT_BINARY_DIR})

	add_subdirectory(Builtins)			add_subdirectory(Builtins)
	add_subdirectory(libs)			add_subdirectory(libs)
	add_subdirectory(XRay)			add_subdirectory(XRay)
	add_subdirectory(LCALS)			add_subdirectory(LCALS)
	add_subdirectory(harris)			add_subdirectory(harris)
	add_subdirectory(ImageProcessing)			add_subdirectory(ImageProcessing)
	add_subdirectory(LoopInterchange)			add_subdirectory(LoopInterchange)
				add_subdirectory(LoopVectorization)
	add_subdirectory(MemFunctions)			add_subdirectory(MemFunctions)

MicroBenchmarks/LoopVectorization/CMakeLists.txt

This file was added.

				llvm_test_run(WORKDIR ${CMAKE_CURRENT_BINARY_DIR})

				# Only enable verification of results if neither 'benchmarking only' has been
				# selected nor -ffast-math is passed.
				string(TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE_UPPER)
				set(COMBINED_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CMAKE_CXX_FLAGS_${CMAKE_BUILD_TYPE_UPPER}} ${CPPFLAGS}")
				if (NOT TEST_SUITE_BENCHMARKING_ONLY AND
				NOT ${COMBINED_CXX_FLAGS} MATCHES ".-ffast-math.")
				list(APPEND CPPFLAGS -DBENCH_AND_VERIFY)
				endif()

				llvm_test_executable(LoopVectorizationBenchmarks
				main.cpp
				MathFunctions.cpp)

				target_link_libraries(LoopVectorizationBenchmarks benchmark)

MicroBenchmarks/LoopVectorization/MathFunctions.cpp

This file was added.

				#include <iostream>
				#include <math.h>
				#include <random>

				#include "benchmark/benchmark.h"

				#define N 100000

				// Apply Fn(A[i]) + Fn(B[i]) in loop, with default loop vectorization settings.
				template <typename T> static void run_fn_autovec(T A, T B, T C, T (Fn)(T)) {
				for (unsigned i = 0; i < N; i++) {
				lebedev.riUnsubmitted Not Done Reply Inline Actions Sure you don't want the counter to be `int`? lebedev.ri: Sure you don't want the counter to be `int`?
				C[i] = Fn(A[i]) + Fn(B[i]);
				}
				}

				// Apply Fn(A[i]) + Fn(B[i]) in loop, with loop vectorization disabled.
				template <typename T> static void run_fn_novec(T A, T B, T C, T (Fn)(T)) {
				#pragma clang loop vectorize(disable) interleave(disable)
				for (unsigned i = 0; i < N; i++) {
				C[i] = Fn(A[i]) + Fn(B[i]);
				}
				}

				// Initialize arrays A, B and T with random numbers.
				template <typename T> static void init_data(T A, T B, T *C) {
				std::uniform_real_distribution<T> dist(-100, 100);
				std::mt19937 rng(12345);
				for (unsigned i = 0; i < N; i++) {
				A[i] = dist(rng);
				B[i] = dist(rng);
				C[i] = dist(rng);
				}
				}

				// Benchmark auto-vectorized version using Fn.
				template <typename T>
				static void __attribute__((always_inline))
				benchmark_fn_autovec(benchmark::State &state, T (*Fn)(T)) {
				std::unique_ptr<T[]> A(new T[N]);
				std::unique_ptr<T[]> B(new T[N]);
				std::unique_ptr<T[]> C(new T[N]);
				MeinersburUnsubmitted Not Done Reply Inline Actions /home/meinersbur/src/llvm-test-suite/MicroBenchmarks/LoopVectorization/MathFunctions.cpp:39:8: error: no member named 'unique_ptr' in namespace 'std' std::unique_ptr<T[]> A(new T[N]); ~~~~~^ (with libstdc++) `#include <memory>` should be added. Meinersbur: ``` /home/meinersbur/src/llvm-test-suite/MicroBenchmarks/LoopVectorization/MathFunctions.cpp:39…
				init_data(&A[0], &B[0], &C[0]);

				#ifdef BENCH_AND_VERIFY
				// Verify the vectorized and un-vectorized versions produce the same results.
				{
				std::unique_ptr<T[]> CNovec(new T[N]);
				for (unsigned i = 0; i < N; i++)
				CNovec[i] = C[i];

				run_fn_novec(&A[0], &B[0], &CNovec[0], Fn);
				run_fn_autovec(&A[0], &B[0], &C[0], Fn);
				for (unsigned i = 0; i < N; i++)
				// If there's a value mismatch, fall back to fpclassify.
				if (C[i] != CNovec[i] && fpclassify(C[i]) != fpclassify(CNovec[i])) {
				std::cerr << "ERROR: autovec result different to scalar result " << C[i]
				MeinersburUnsubmitted Not Done Reply Inline Actions I think the `Inf` case is already tested with `!=`, only NaN needs to be compared separately. Since `NaN != NaN`. Meinersbur: I think the `Inf` case is already tested with `!=`, only NaN needs to be compared separately.
				lebedev.riUnsubmitted Not Done Reply Inline Actions I'm not confident this is consistent with the rules will withstand the whatever compiler optimization levels test-suite is compiled with Perhaps you want to be on the safe side and do `fpclassify()`, fail if they mismatch, and compare value-equality if they are normal/subnormal? lebedev.ri: I'm not confident this is 1. consistent with the rules 2. will withstand the whatever compiler…
				fhahnAuthorUnsubmitted Done Reply Inline Actions Thanks, I updated the patch to fall back to `fpclassify` if there's a mis-match. consistent with the rules Do you mean rules for the test-suite? will withstand the whatever compiler optimization levels test-suite is compiled with I think at the moment, the only issue at the moment could be that the compiler picks a different vector math function with `-ffast-math` or the user specifies a vector library that does not guarantee the same results as the scalar version. The `-fast-math` case should be handled in the latest version, but the user specified vector library is not. Not sure what we can do about that case and how/if other tests deal with that issue. fhahn: Thanks, I updated the patch to fall back to `fpclassify` if there's a mis-match. > consistent…
				<< " != " << CNovec[i] << " at index " << i << "\n";
				exit(1);
				}
				}
				#endif

				for (auto _ : state) {
				run_fn_autovec(&A[0], &B[0], &C[0], Fn);
				benchmark::DoNotOptimize(A);
				lebedev.riUnsubmitted Not Done Reply Inline Actions Not really sure if `DoNotOptimize` is enough, maybe you also want `ClobberMemory` lebedev.ri: Not really sure if `DoNotOptimize` is enough, maybe you also want `ClobberMemory`
				benchmark::DoNotOptimize(B);
				benchmark::DoNotOptimize(C);
				}
				}

				// Benchmark version using Fn with vectorization disabled.
				template <typename T>
				static void __attribute__((always_inline))
				benchmark_fn_novec(benchmark::State &state, T (*Fn)(T)) {
				std::unique_ptr<T[]> A(new T[N]);
				std::unique_ptr<T[]> B(new T[N]);
				std::unique_ptr<T[]> C(new T[N]);
				init_data(&A[0], &B[0], &C[0]);

				for (auto _ : state) {
				run_fn_novec(&A[0], &B[0], &C[0], Fn);
				benchmark::DoNotOptimize(A);
				benchmark::DoNotOptimize(B);
				benchmark::DoNotOptimize(C);
				}
				}

				// Add add auto-vectorized and disabled vectorization benchmarks for math
				// function fn and type ty.
				#define ADD_BENCHMARK(fn, ty) \
				void BENCHMARK_##fn##_autovec_##ty##_(benchmark::State &state) { \
				benchmark_fn_autovec<ty>(state, fn); \
				} \
				BENCHMARK(BENCHMARK_##fn##_autovec_##ty##_)->Unit(benchmark::kMicrosecond); \
				\
				void BENCHMARK_##fn##_novec_##ty##_(benchmark::State &state) { \
				benchmark_fn_novec<ty>(state, fn); \
				} \
				BENCHMARK(BENCHMARK_##fn##_novec_##ty##_)->Unit(benchmark::kMicrosecond);

				ADD_BENCHMARK(expf, float)
				ADD_BENCHMARK(exp, double)

				ADD_BENCHMARK(acosf, float)
				ADD_BENCHMARK(acos, double)

				ADD_BENCHMARK(asinf, float)
				ADD_BENCHMARK(asin, double)

				ADD_BENCHMARK(atanf, float)
				ADD_BENCHMARK(atan, double)

				ADD_BENCHMARK(cbrtf, float)
				ADD_BENCHMARK(cbrt, double)

				ADD_BENCHMARK(erff, float)
				ADD_BENCHMARK(erf, double)

				ADD_BENCHMARK(cosf, float)
				ADD_BENCHMARK(cos, double)

				ADD_BENCHMARK(sinf, float)
				ADD_BENCHMARK(sin, double)

				ADD_BENCHMARK(sinhf, float)
				ADD_BENCHMARK(sinh, double)

MicroBenchmarks/LoopVectorization/main.cpp

This file was added.

				#include "benchmark/benchmark.h"

				int main(int argc, char *argv[]) {
				benchmark::Initialize(&argc, argv);

				benchmark::RunSpecifiedBenchmarks();
				return EXIT_SUCCESS;
				}