This is an archive of the discontinued LLVM Phabricator instance.

Add first microbenchmarks for matrix types extensions.
Changes PlannedPublic

Authored by fhahn on Jul 13 2020, 10:12 AM.

Download Raw Diff

Details

Reviewers

anemet
paquette
LuoYuanke
SjoerdMeijer

Summary

This patch adds an initial set of micro benchmarks for the matrix types
extension.

Diff Detail

Repository

rOLDT svn-test-suite

Build Status

Buildable 63990
Build 79172: arc lint + arc unit

Event Timeline

fhahn created this revision.Jul 13 2020, 10:12 AM

Herald added a subscriber: mgorny. · View Herald TranscriptJul 13 2020, 10:12 AM

Harbormaster completed remote builds in B63990: Diff 277475.Jul 13 2020, 10:12 AM

paquette added inline comments.Jul 13 2020, 10:18 AM

MicroBenchmarks/MatrixTypes/main.cpp
147	Why 15 and 19?

fhahn marked an inline comment as done.Jul 13 2020, 10:27 AM

fhahn added inline comments.

MicroBenchmarks/MatrixTypes/main.cpp
147	No particular reason, it could be 17 and 13 or a similar combination around the 16 element range. The intention for those is to also cover some cases where the number of elements isn't a power-of-2 and more unusual combinations.

fhahn mentioned this in D83910: Fix random number generation and floating point comparison in matrix-types-spec.cpp .Jul 16 2020, 1:41 AM

ping

Looks decent as an initial commit to me. Two high level questions:

I haven't looked at these MicroBenchmarks yets in the test-suite, but in general it would be convenient if a benchmarks also does a correctness check. Do you think there would be any value in doing that here? If so, would that easy to add?
In benchmarking, stable numbers are convenient. Since the input is randomly generated, I was wondering if there could be timing differences depending on different inputs? But I guess not here?

In D83692#2163816, @SjoerdMeijer wrote:

Looks decent as an initial commit to me. Two high level questions:

I haven't looked at these MicroBenchmarks yets in the test-suite, but in general it would be convenient if a benchmarks also does a correctness check. Do you think there would be any value in doing that here? If so, would that easy to add?

Agreed, that would indeed be convenient. Let me change that.

In benchmarking, stable numbers are convenient. Since the input is randomly generated, I was wondering if there could be timing differences depending on different inputs? But I guess not here?

I would expect that the difference in the actual FP values would not impact the throughput/latency of floating point units. I don't think there's anything about that in the public Arm Cortex tuning guides. From what I've seen so far on the devices I have access to is that the numbers are relatively stable, although sometimes there are rather large swings for some individual benchmarks (like +100% in runtime for single benchmarks). But my working theory was that this was due to system noise. If there's a real issue, I think we can address it once it appears

Feel free to ignore: "On Subnormal Floating Point and Abnormal Timing"
http://www.ieee-security.org/TC/SP2015/papers-archived/6949a623.pdf

A NaN matrix times a NaN matrix will be slow.

Revision Contents

Path

Size

MicroBenchmarks/

CMakeLists.txt

1 line

MatrixTypes/

CMakeLists.txt

11 lines

main.cpp

214 lines

Diff 277475

MicroBenchmarks/CMakeLists.txt

	file(COPY lit.local.cfg DESTINATION ${CMAKE_CURRENT_BINARY_DIR})			file(COPY lit.local.cfg DESTINATION ${CMAKE_CURRENT_BINARY_DIR})

	add_subdirectory(Builtins)			add_subdirectory(Builtins)
	add_subdirectory(libs)			add_subdirectory(libs)
	add_subdirectory(XRay)			add_subdirectory(XRay)
	add_subdirectory(LCALS)			add_subdirectory(LCALS)
	add_subdirectory(harris)			add_subdirectory(harris)
	add_subdirectory(ImageProcessing)			add_subdirectory(ImageProcessing)
	add_subdirectory(LoopInterchange)			add_subdirectory(LoopInterchange)
				add_subdirectory(MatrixTypes)
	add_subdirectory(MemFunctions)			add_subdirectory(MemFunctions)

MicroBenchmarks/MatrixTypes/CMakeLists.txt

This file was added.

				include(CheckCXXCompilerFlag)

				# Enable matrix types extension benchmarks for compilers supporting -fenable-matrix.
				check_cxx_compiler_flag(-fenable-matrix COMPILER_HAS_MATRIX_FLAG)
				if (COMPILER_HAS_MATRIX_FLAG)
				llvm_test_run(WORKDIR ${CMAKE_CURRENT_BINARY_DIR})

				set_property(SOURCE main.cpp PROPERTY COMPILE_FLAGS -fenable-matrix)
				llvm_test_executable(MatrixTypes main.cpp)
				target_link_libraries(MatrixTypes benchmark)
				endif()

MicroBenchmarks/MatrixTypes/main.cpp

This file was added.

				#include <random>
				#include <type_traits>

				#include "benchmark/benchmark.h"

				// Micro benchmarks for the matrix types extensions.

				#if __has_extension(matrix_types)

				namespace {

				template <typename ElementTy, unsigned R, unsigned C>
				using matrix_t = ElementTy __attribute__((matrix_type(R, C)));

				template <typename ElementTy, unsigned R, unsigned C>
				std::unique_ptr<matrix_t<ElementTy, R, C>> allocateMatrix() {
				return std::unique_ptr<matrix_t<ElementTy, R, C>>(
				new matrix_t<ElementTy, R, C>);
				}

				template <typename ElementTy, unsigned R, unsigned C,
				typename std::enable_if_t<std::is_floating_point<ElementTy>::value,
				int> = 0>
				void initRandom(matrix_t<ElementTy, R, C> &M) {
				std::default_random_engine generator;
				std::uniform_real_distribution<ElementTy> distribution;

				for (unsigned I = 0; I < R; I++)
				for (unsigned J = 0; J < C; J++)
				M[I][J] = distribution(generator);
				}

				template <
				typename ElementTy, unsigned R, unsigned C,
				typename std::enable_if_t<std::is_integral<ElementTy>::value, int> = 0>
				void initRandom(matrix_t<ElementTy, R, C> &M) {
				std::default_random_engine generator;
				std::uniform_int_distribution<ElementTy> distribution;

				for (unsigned I = 0; I < R; I++)
				for (unsigned J = 0; J < C; J++)
				M[I][J] = distribution(generator);
				}

				template <typename ElementTy, unsigned R0, unsigned C0, unsigned R1,
				unsigned C1>
				static void BM_MatrixTypes_Mult(benchmark::State &state) {
				auto XPtr = allocateMatrix<ElementTy, R0, C0>();
				auto YPtr = allocateMatrix<ElementTy, R1, C1>();

				auto ZPtr = allocateMatrix<ElementTy, R0, C1>();

				matrix_t<ElementTy, R0, C0> &X = *XPtr;
				matrix_t<ElementTy, R1, C1> &Y = *YPtr;
				matrix_t<ElementTy, R0, C1> &Z = *ZPtr;

				initRandom(X);
				initRandom(Y);
				for (auto _ : state) {
				benchmark::DoNotOptimize(XPtr);
				benchmark::DoNotOptimize(YPtr);
				benchmark::DoNotOptimize(ZPtr);
				Z = X * Y;
				}
				}

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 3, 3, 3, 3);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 3, 3, 3, 3);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, char, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, unsigned, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, long long int, 4, 4, 4, 4);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 3, 2, 2, 5);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 3, 2, 2, 5);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, char, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, unsigned, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, long long int, 8, 8, 8, 8);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 12, 8, 8, 14);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 12, 8, 8, 14);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 15, 19, 19, 15);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 15, 19, 19, 15);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 16, 16, 16, 16);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 16, 16, 16, 16);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 32, 32, 32, 32);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 32, 32, 32, 32);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 48, 48, 48, 48);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 48, 48, 48, 48);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, float, 64, 64, 64, 64);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult, double, 64, 64, 64, 64);

				template <typename ElementTy, unsigned R0, unsigned C0, unsigned R1,
				unsigned C1>
				static void BM_MatrixTypes_Mult_Transpose(benchmark::State &state) {
				auto XPtr = allocateMatrix<ElementTy, R0, C0>();
				auto YPtr = allocateMatrix<ElementTy, R1, C1>();

				// Y is transposed before multiplying.
				auto ZPtr = allocateMatrix<ElementTy, R0, R1>();

				matrix_t<ElementTy, R0, C0> &X = *XPtr;
				matrix_t<ElementTy, R1, C1> &Y = *YPtr;
				matrix_t<ElementTy, R0, R1> &Z = *ZPtr;

				initRandom(X);
				initRandom(Y);
				for (auto _ : state) {
				benchmark::DoNotOptimize(XPtr);
				benchmark::DoNotOptimize(YPtr);
				benchmark::DoNotOptimize(ZPtr);
				Z = X * __builtin_matrix_transpose(Y);
				}
				}

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 3, 3, 3, 3);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 3, 3, 3, 3);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, char, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, unsigned, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, long long int, 4, 4, 4, 4);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 3, 2, 5, 2);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 3, 2, 5, 2);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, char, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, unsigned, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, long long int, 8, 8, 8, 8);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 12, 8, 14, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 12, 8, 14, 8);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 15, 19, 15, 19);
				paquetteUnsubmitted Not Done Reply Inline Actions Why 15 and 19? paquette: Why 15 and 19?
				fhahnAuthorUnsubmitted Done Reply Inline Actions No particular reason, it could be 17 and 13 or a similar combination around the 16 element range. The intention for those is to also cover some cases where the number of elements isn't a power-of-2 and more unusual combinations. fhahn: No particular reason, it could be 17 and 13 or a similar combination around the 16 element…
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 15, 19, 15, 19);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 16, 16, 16, 16);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 16, 16, 16, 16);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, float, 32, 32, 32, 32);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Transpose, double, 32, 32, 32, 32);

				template <typename ElementTy, unsigned R0, unsigned C0, unsigned R1,
				unsigned C1>
				static void BM_MatrixTypes_Mult_Add(benchmark::State &state) {
				auto XPtr = allocateMatrix<ElementTy, R0, C0>();
				auto YPtr = allocateMatrix<ElementTy, R1, C1>();

				// Y is transposed before multiplying.
				auto ZPtr = allocateMatrix<ElementTy, R0, R1>();

				matrix_t<ElementTy, R0, C0> &X = *XPtr;
				matrix_t<ElementTy, R1, C1> &Y = *YPtr;
				matrix_t<ElementTy, R0, R1> &Z = *ZPtr;

				initRandom(X);
				initRandom(Y);
				initRandom(Z);
				for (auto _ : state) {
				benchmark::DoNotOptimize(XPtr);
				benchmark::DoNotOptimize(YPtr);
				benchmark::DoNotOptimize(ZPtr);
				Z = Z + X * Y;
				}
				}

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 3, 3, 3, 3);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 3, 3, 3, 3);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, char, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, unsigned, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 4, 4, 4, 4);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, long long int, 4, 4, 4, 4);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 3, 2, 5, 2);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 3, 2, 5, 2);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, char, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, unsigned, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 8, 8, 8, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, long long int, 8, 8, 8, 8);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 12, 8, 14, 8);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 12, 8, 14, 8);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 15, 19, 15, 19);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 15, 19, 15, 19);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 16, 16, 16, 16);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 16, 16, 16, 16);

				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, float, 32, 32, 32, 32);
				BENCHMARK_TEMPLATE(BM_MatrixTypes_Mult_Add, double, 32, 32, 32, 32);

				} // namespace

				#endif

				BENCHMARK_MAIN();