Download Raw Diff

Details

Reviewers

fhahn
t.p.northover
paquette

Commits

rT08de51078b0a: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for…

Summary

This patch adds a correctness test to check the outcome of vectorized truncate or zero-extend operations in a loop for different vector types.
This patch also adds performance tests for combined operations of truncate/zero-extend and addition.

The goal of this benchmark is to check the impact of AArch64 specific changes in
D133495, D120571, D135229 and D136722.

Diff Detail

Repository

rT test-suite

Build Status

Buildable 199952
Build 303662: arc lint + arc unit

Event Timeline

nilanjana_basu created this revision.Nov 15 2022, 1:13 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 15 2022, 1:13 PM

Herald added a subscriber: kristof.beyls. · View Herald Transcript

nilanjana_basu requested review of this revision.Nov 15 2022, 1:13 PM

Harbormaster completed remote builds in B197825: Diff 475560.Nov 15 2022, 1:13 PM

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptNov 15 2022, 1:13 PM

nilanjana_basu added reviewers: fhahn, t.p.northover, paquette.Nov 15 2022, 1:14 PM

Why randomized?

Removed randomization in input & combined correctness tests with performance ones. Explicitly added vectorization width for 16 elements since the related patches target this width.

Harbormaster completed remote builds in B197860: Diff 475615.Nov 15 2022, 3:50 PM

In D138059#3928627, @paquette wrote:

Why randomized?

You're right, it was unnecessary. Removed it in the latest patch & combined the correctness tests with the performance ones.

In D138059#3928627, @paquette wrote:

Why randomized?

! In D138059#3929074, @nilanjana_basu wrote:

Removed randomization in input & combined correctness tests with performance ones. Explicitly added vectorization width for 16 elements since the related patches target this width.

I think the main reason for initializing with random data is to make the benchmarks more robust so the optimizer won't be able to (partly) optimize out our benchmark code?

fhahn edited the summary of this revision. (Show Details)Nov 15 2022, 4:00 PM

In D138059#3929083, @fhahn wrote:

In D138059#3928627, @paquette wrote:

Why randomized?

! In D138059#3929074, @nilanjana_basu wrote:

Removed randomization in input & combined correctness tests with performance ones. Explicitly added vectorization width for 16 elements since the related patches target this width.

I think the main reason for initializing with random data is to make the benchmarks more robust so the optimizer won't be able to (partly) optimize out our benchmark code?

I checked that at the IR level the generated code had the relevant trunc & zext instructions, 4 times per function for interleave_count 4. I could also see a performance difference based on the related patches it is meant to test. To be on the safe side, what do you think about adding "benchmark::DoNotOptimize(A)" instead? Or should we prefer reverting to the old form?

In D138059#3929188, @nilanjana_basu wrote:

I checked that at the IR level the generated code had the relevant trunc & zext instructions, 4 times per function for interleave_count 4. I could also see a performance difference based on the related patches it is meant to test. To be on the safe side, what do you think about adding "benchmark::DoNotOptimize(A)" instead? Or should we prefer reverting to the old form?

Yeah, it is probably fine now, but testing with a single value also seems to make the test less interesting. You could keep the random initialization and add a version of truncOrZextVecInLoopWithVW8 that disables vectorization to generate comparison data for testing.

Reverted to using random inputs & changed correctness test to compare against same operations with no vectorization

Harbormaster completed remote builds in B198611: Diff 476668.Nov 19 2022, 12:53 AM

In D138059#3935023, @fhahn wrote:

Yeah, it is probably fine now, but testing with a single value also seems to make the test less interesting. You could keep the random initialization and add a version of truncOrZextVecInLoopWithVW8 that disables vectorization to generate comparison data for testing.

Yes, that sounds like a better correctness test. Have updated it.

MicroBenchmarks/LoopVectorization/VectorOperations.cpp
22	Added noinline for ease of verifying the IR to check if this part is not optimized out. Can be removed if deemed unnecessary.

fhahn added inline comments.Nov 22 2022, 3:18 AM

MicroBenchmarks/LoopVectorization/VectorOperations.cpp
22	it might confuse readers, so its better to remove it I think
23	LLM coding style uses uppercase for variables
67	it would also be good to benchmark a version of the loop without any pragmas. Also, why fix the VF?
141	It might be better to move this before the main loop so we fail early.
159–160	we also need versions with different dst types than i8. Same for the src types for zexts

Addressed reviewer comments

Harbormaster completed remote builds in B199951: Diff 478478.Nov 29 2022, 12:36 AM

Variable name changes

Harbormaster completed remote builds in B199952: Diff 478479.Nov 29 2022, 12:39 AM

Ran clang-format

Harbormaster completed remote builds in B199954: Diff 478481.Nov 29 2022, 12:46 AM

nilanjana_basu marked an inline comment as done.Nov 29 2022, 12:58 AM

nilanjana_basu added inline comments.

MicroBenchmarks/LoopVectorization/VectorOperations.cpp
67	Added a version of the benchmark with only vectorization enabled but no other pragmas. Changed the VF to test vectors of length 16 specifically, similar to the VF 8 case, assuming these two are commonly used VFs.
141	Done. Also added the correctness for each version of the benchmarks i.e. for different vectorization configurations. Each configuration should trigger different paths of the trunc/zext lowering & therefore, will be good to be tested for correctness.

fhahn added inline comments.Dec 1 2022, 9:36 AM

MicroBenchmarks/LoopVectorization/VectorOperations.cpp
79	Is this effectively the same code as `benchForTruncOrZextVecInLoopWithVW8` and just calling a different benchmark function? If so, better to make the function an argument to avoid duplication?

Removed duplicate code by adding function pointers as parameter as advised in the reviews. Added more performance tests using ZExt/Trunc operations in combination with addition operation.

Harbormaster completed remote builds in B200592: Diff 479382.Dec 1 2022, 12:21 PM

nilanjana_basu marked an inline comment as done.Dec 1 2022, 12:21 PM

LGTM, thanks!

MicroBenchmarks/LoopVectorization/VectorOperations.cpp
148	it might help with readability if there was a newline after each `BENCHMARK...`

This revision is now accepted and ready to land.Dec 1 2022, 1:51 PM

nilanjana_basu retitled this revision from [MicroBenchmarks,AArch64] Added correctness test for truncate or zero-extend vector operations to [MicroBenchmarks,AArch64] Added correctness test & other performance tests for truncate or zero-extend vector operations.Dec 1 2022, 9:49 PM

nilanjana_basu edited the summary of this revision. (Show Details)

Closed by commit rT08de51078b0a: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for… (authored by nilanjana_basu). · Explain WhyDec 1 2022, 10:09 PM

This revision was automatically updated to reflect the committed changes.

nilanjana_basu added a commit: rT08de51078b0a: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for….

nilanjana_basu mentioned this in rG955c0f13cd70: [AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x i8X>' to use tbl….Dec 9 2022, 12:51 AM

Diff 478479

MicroBenchmarks/LoopVectorization/VectorOperations.cpp

				// This program tests vectorized truncates & zero-extends for performance and
				// correctness
	#include <iostream>			#include <iostream>
	#include <memory>			#include <memory>
	#include <random>			#include <random>

	#include "benchmark/benchmark.h"			#include "benchmark/benchmark.h"

	#define ITERATIONS 10000			#define ITERATIONS 10000

	static std::mt19937 rng;			static std::mt19937 rng;

	// Initialize array A with random numbers.			// Initialize array A with random numbers.
	template <typename Ty>			template <typename Ty>
	static void init_data(const std::unique_ptr<Ty[]> &A, unsigned N) {			static void init_data(const std::unique_ptr<Ty[]> &A, unsigned N) {
	std::uniform_int_distribution<uint64_t> distrib(			std::uniform_int_distribution<Ty> distrib(std::numeric_limits<Ty>::min(),
	std::numeric_limits<Ty>::min(), std::numeric_limits<Ty>::max());			std::numeric_limits<Ty>::max());
	for (unsigned i = 0; i < N; i++)			for (unsigned I = 0; I < N; I++)
	A[i] = static_cast<Ty>(distrib(rng));			A[I] = distrib(rng);
				}

				// Truncate/Zero-extend elements to create expected results with no
				nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Added noinline for ease of verifying the IR to check if this part is not optimized out. Can be removed if deemed unnecessary. nilanjana_basu: Added //noinline// for ease of verifying the IR to check if this part is not optimized out. Can…
				fhahnUnsubmitted Done Reply Inline Actions it might confuse readers, so its better to remove it I think fhahn: it might confuse readers, so its better to remove it I think
				// vectorization
				fhahnUnsubmitted Done Reply Inline Actions LLM coding style uses uppercase for variables fhahn: LLM coding style uses uppercase for variables
				template <typename Ty1, typename Ty2> static void
				truncOrZextWithNoVec(const Ty1 A, Ty2 B, int Iterations) {
				#pragma clang loop vectorize(disable)
				for (unsigned I = 0; I < Iterations; I++) {
				B[I] = A[I];
				}
	}			}

	// Truncate/Zero-extend each vector element in a vectorized loop with vectorization width 8			// Truncate/Zero-extend each vector element in a vectorized loop with vectorization width 8
	template <typename Ty1, typename Ty2> static void truncOrZextVecInLoopWithVW8(const Ty1 A, Ty2 B, int iterations) {			template <typename Ty1, typename Ty2> static void truncOrZextVecInLoopWithVW8(const Ty1 A, Ty2 B, int Iterations) {
	#pragma clang loop vectorize_width(8) interleave_count(4)			#pragma clang loop vectorize_width(8) interleave_count(4)
	for (unsigned i = 0; i < iterations; i++) {			for (unsigned I = 0; I < Iterations; I++) {
	B[i] = A[i];			B[I] = A[I];
	}			}
	}			}

	template <typename Ty1, typename Ty2> static void __attribute__((always_inline))			template <typename Ty1, typename Ty2> static void __attribute__((always_inline))
	benchForTruncOrZextVecInLoopWithVW8(benchmark::State &state) {			benchForTruncOrZextVecInLoopWithVW8(benchmark::State &state) {
	std::unique_ptr<Ty1[]> A(new Ty1[ITERATIONS]);			std::unique_ptr<Ty1[]> A(new Ty1[ITERATIONS]);
	std::unique_ptr<Ty2[]> B(new Ty2[ITERATIONS]);			std::unique_ptr<Ty2[]> B(new Ty2[ITERATIONS]);
				std::unique_ptr<Ty2[]> C(new Ty2[ITERATIONS]);

	init_data(A, ITERATIONS);			init_data(A, ITERATIONS);
	init_data(B, ITERATIONS);
				// Check for correctness
				truncOrZextWithNoVec(&A[0], &C[0], ITERATIONS);
				truncOrZextVecInLoopWithVW8(&A[0], &B[0], ITERATIONS);
				for (int I = 0; I < ITERATIONS; I++) {
				if (B[I] != C[I]) {
				std::cerr << "ERROR: Trunc or ZExt operation on " << A[I]
				<< " is showing result " << B[I] << " instead of " << C[I]
				<< "\n";
				exit(1);
				}
				}

	for (auto _ : state) {			for (auto _ : state) {
	benchmark::DoNotOptimize(B);			benchmark::DoNotOptimize(B);
	benchmark::ClobberMemory();			benchmark::ClobberMemory();
	truncOrZextVecInLoopWithVW8(&A[0], &B[0], ITERATIONS);			truncOrZextVecInLoopWithVW8(&A[0], &B[0], ITERATIONS);
	}			}
	}			}

	// Truncate/Zero-extend each vector element in a vectorized loop			// Truncate/Zero-extend each vector element in a vectorized loop with vector width 16
				fhahnUnsubmitted Done Reply Inline Actions it would also be good to benchmark a version of the loop without any pragmas. Also, why fix the VF? fhahn: it would also be good to benchmark a version of the loop without any pragmas. Also, why fix…
				nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Added a version of the benchmark with only vectorization enabled but no other pragmas. Changed the VF to test vectors of length 16 specifically, similar to the VF 8 case, assuming these two are commonly used VFs. nilanjana_basu: Added a version of the benchmark with only vectorization enabled but no other pragmas. Changed…
	template <typename Ty1, typename Ty2> static void truncOrZextVecInLoop(const Ty1 A, Ty2 B, int iterations) {			template <typename Ty1, typename Ty2>
	#pragma clang loop interleave_count(4)			static void truncOrZextVecInLoopWithVW16(const Ty1 A, Ty2 B, int Iterations) {
	for (unsigned i = 0; i < iterations; i++) {			#pragma clang loop vectorize_width(16) interleave_count(4)
	B[i] = A[i];			for (unsigned I = 0; I < Iterations; I++) {
				B[I] = A[I];
	}			}
	}			}

	template <typename Ty1, typename Ty2> static void __attribute__((always_inline))			template <typename Ty1, typename Ty2>
				static void __attribute__((always_inline))
				benchForTruncOrZextVecInLoopWithVW16(benchmark::State &state) {
				std::unique_ptr<Ty1[]> A(new Ty1[ITERATIONS]);
				fhahnUnsubmitted Done Reply Inline Actions Is this effectively the same code as `benchForTruncOrZextVecInLoopWithVW8` and just calling a different benchmark function? If so, better to make the function an argument to avoid duplication? fhahn: Is this effectively the same code as `benchForTruncOrZextVecInLoopWithVW8` and just calling a…
				std::unique_ptr<Ty2[]> B(new Ty2[ITERATIONS]);
				std::unique_ptr<Ty2[]> C(new Ty2[ITERATIONS]);

				init_data(A, ITERATIONS);

				// Check for correctness
				truncOrZextWithNoVec(&A[0], &C[0], ITERATIONS);
				truncOrZextVecInLoopWithVW16(&A[0], &B[0], ITERATIONS);
				for (int I = 0; I < ITERATIONS; I++) {
				if (B[I] != C[I]) {
				std::cerr << "ERROR: Trunc or ZExt operation on " << A[I]
				<< " is showing result " << B[I] << " instead of " << C[I]
				<< "\n";
				exit(1);
				}
				}

				for (auto _ : state) {
				benchmark::DoNotOptimize(B);
				benchmark::ClobberMemory();
				truncOrZextVecInLoopWithVW16(&A[0], &B[0], ITERATIONS);
				}
				}

				// Truncate/Zero-extend each vector element in a vectorized loop with vector width 16
				template <typename Ty1, typename Ty2>
				static void truncOrZextVecInLoop(const Ty1 A, Ty2 B, int Iterations) {
				#pragma clang loop vectorize(enable)
				for (unsigned I = 0; I < Iterations; I++) {
				B[I] = A[I];
				}
				}

				template <typename Ty1, typename Ty2>
				static void __attribute__((always_inline))
	benchForTruncOrZextVecInLoop(benchmark::State &state) {			benchForTruncOrZextVecInLoop(benchmark::State &state) {
	std::unique_ptr<Ty1[]> A(new Ty1[ITERATIONS]);			std::unique_ptr<Ty1[]> A(new Ty1[ITERATIONS]);
	std::unique_ptr<Ty2[]> B(new Ty2[ITERATIONS]);			std::unique_ptr<Ty2[]> B(new Ty2[ITERATIONS]);
				std::unique_ptr<Ty2[]> C(new Ty2[ITERATIONS]);

	init_data(A, ITERATIONS);			init_data(A, ITERATIONS);
	init_data(B, ITERATIONS);
				// Check for correctness
				truncOrZextWithNoVec(&A[0], &C[0], ITERATIONS);
				truncOrZextVecInLoop(&A[0], &B[0], ITERATIONS);
				for (int I = 0; I < ITERATIONS; I++) {
				if (B[I] != C[I]) {
				std::cerr << "ERROR: Trunc or ZExt operation on " << A[I]
				<< " is showing result " << B[I] << " instead of " << C[I]
				<< "\n";
				exit(1);
				}
				}

	for (auto _ : state) {			for (auto _ : state) {
	benchmark::DoNotOptimize(B);			benchmark::DoNotOptimize(B);
	benchmark::ClobberMemory();			benchmark::ClobberMemory();
	truncOrZextVecInLoop(&A[0], &B[0], ITERATIONS);			truncOrZextVecInLoop(&A[0], &B[0], ITERATIONS);
	}			}
	}			}

	// Add vectorized truncate or zero-extend operation benchmarks for different element types			// Add vectorized truncate or zero-extend operation benchmarks for different element types
				fhahnUnsubmitted Done Reply Inline Actions It might be better to move this before the main loop so we fail early. fhahn: It might be better to move this before the main loop so we fail early.
				nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Done. Also added the correctness for each version of the benchmarks i.e. for different vectorization configurations. Each configuration should trigger different paths of the trunc/zext lowering & therefore, will be good to be tested for correctness. nilanjana_basu: Done. Also added the correctness for each version of the benchmarks i.e. for different…
	#define ADD_BENCHMARK(ty1, ty2) \			#define ADD_BENCHMARK(ty1, ty2) \
	void benchForTruncOrZextVecInLoopWithVW8From_##ty1##_To_##ty2##_(benchmark::State &state) { \			void benchForTruncOrZextVecInLoopWithVW8From_##ty1##_To_##ty2##_( \
				benchmark::State &state) { \
	benchForTruncOrZextVecInLoopWithVW8<ty1, ty2>(state); \			benchForTruncOrZextVecInLoopWithVW8<ty1, ty2>(state); \
	} \			} \
	BENCHMARK(benchForTruncOrZextVecInLoopWithVW8From_##ty1##_To_##ty2##_); \			BENCHMARK(benchForTruncOrZextVecInLoopWithVW8From_##ty1##_To_##ty2##_); \
	void benchForTruncOrZextVecInLoopFrom_##ty1##_To_##ty2##_(benchmark::State &state) { \			void benchForTruncOrZextVecInLoopWithVW16From_##ty1##_To_##ty2##_( \
				fhahnUnsubmitted Not Done Reply Inline Actions it might help with readability if there was a newline after each `BENCHMARK...` fhahn: it might help with readability if there was a newline after each `BENCHMARK...`
				benchmark::State &state) { \
				benchForTruncOrZextVecInLoopWithVW16<ty1, ty2>(state); \
				} \
				BENCHMARK(benchForTruncOrZextVecInLoopWithVW16From_##ty1##_To_##ty2##_); \
				void benchForTruncOrZextVecInLoopFrom_##ty1##_To_##ty2##_( \
				benchmark::State &state) { \
	benchForTruncOrZextVecInLoop<ty1, ty2>(state); \			benchForTruncOrZextVecInLoop<ty1, ty2>(state); \
	} \			} \
	BENCHMARK(benchForTruncOrZextVecInLoopFrom_##ty1##_To_##ty2##_); \			BENCHMARK(benchForTruncOrZextVecInLoopFrom_##ty1##_To_##ty2##_);

	/* Vectorized truncate operations */			/* Vectorized truncate operations */
	ADD_BENCHMARK(uint64_t, uint8_t)
	ADD_BENCHMARK(uint32_t, uint8_t)
	ADD_BENCHMARK(uint16_t, uint8_t)			ADD_BENCHMARK(uint16_t, uint8_t)
				fhahnUnsubmitted Done Reply Inline Actions we also need versions with different dst types than i8. Same for the src types for zexts fhahn: we also need versions with different dst types than i8. Same for the src types for zexts
				ADD_BENCHMARK(uint32_t, uint8_t)
				ADD_BENCHMARK(uint64_t, uint8_t)
				ADD_BENCHMARK(uint32_t, uint16_t)
				ADD_BENCHMARK(uint64_t, uint16_t)
				ADD_BENCHMARK(uint64_t, uint32_t)

	/* Vectorized zero extend operations */			/* Vectorized zero extend operations */
				ADD_BENCHMARK(uint8_t, uint16_t)
	ADD_BENCHMARK(uint8_t, uint32_t)			ADD_BENCHMARK(uint8_t, uint32_t)
				ADD_BENCHMARK(uint8_t, uint64_t)
				ADD_BENCHMARK(uint16_t, uint32_t)
				ADD_BENCHMARK(uint16_t, uint64_t)
				ADD_BENCHMARK(uint32_t, uint64_t)

This is an archive of the discontinued LLVM Phabricator instance.

[MicroBenchmarks,AArch64] Added correctness test & other performance tests for truncate or zero-extend vector operations
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 478479

MicroBenchmarks/LoopVectorization/VectorOperations.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[MicroBenchmarks,AArch64] Added correctness test & other performance tests for truncate or zero-extend vector operationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 478479

MicroBenchmarks/LoopVectorization/VectorOperations.cpp

[MicroBenchmarks,AArch64] Added correctness test & other performance tests for truncate or zero-extend vector operations
ClosedPublic