This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
test-suite/trunk/SingleSource/Benchmarks/Misc/
-
trunk/
-
SingleSource/
-
Benchmarks/
-
Misc/
-
CMakeLists.txt
-
LICENSE.TXT
-
evalloop.c
-
evalloop.reference_output

Differential D30313

[test-suite] Add regression test for indirect branch critical edge splitting
ClosedPublic

Authored by mkuper on Feb 23 2017, 3:18 PM.

Download Raw Diff

Details

Reviewers

hfinkel
efriedma

Commits

rOLDT296044: [test-suite] Add regression test for indirect branch critical edge splitting
rL296044: [test-suite] Add regression test for indirect branch critical edge splitting

Summary

This is a regression test benchmark for D29916.

Eli, does this seems reasonable?
For reference, without D29916 this takes ~12 seconds on my machine, and with D29916, about 3.5 seconds.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper created this revision.Feb 23 2017, 3:18 PM

Herald added a subscriber: mgorny. · View Herald TranscriptFeb 23 2017, 3:18 PM

I'd like to add: thanks for doing this! We should definitely encourage the adding of performance tests like this.

In D30313#685163, @hfinkel wrote:

I'd like to add: thanks for doing this! We should definitely encourage the adding of performance tests like this.

You should be thanking Eli, not me, he's pretty much forcing me to do this. :-)

In D30313#685165, @mkuper wrote:

In D30313#685163, @hfinkel wrote:

I'd like to add: thanks for doing this! We should definitely encourage the adding of performance tests like this.

You should be thanking Eli, not me, he's pretty much forcing me to do this. :-)

Okay. Thanks, Eli! :-)

The loop looks fine. Someone else should check that the build system etc. changes are correct.

LGTM

This revision is now accepted and ready to land.Feb 23 2017, 3:44 PM

Closed by commit rL296044: [test-suite] Add regression test for indirect branch critical edge splitting (authored by mkuper). · Explain WhyFeb 23 2017, 3:59 PM

This revision was automatically updated to reflect the committed changes.

Sorry to be this guy: This benchmark is running for too long! We should aim for 0.5-1s runtimes for our benchmarks and the 1000000 looks arbitrary to me. (This takes nearly 3x the time of salsa20, the next slowest benchmark in SingleSource/Benchmarks/Misc for me).

In D30313#690811, @MatzeB wrote:

Sorry to be this guy: This benchmark is running for too long! We should aim for 0.5-1s runtimes for our benchmarks and the 1000000 looks arbitrary to me. (This takes nearly 3x the time of salsa20, the next slowest benchmark in SingleSource/Benchmarks/Misc for me).

Just lowering is the way to go. Aiming for a specific wall time is contraproductive at least today, as we also have modes where we look at profile data and performance counters and want to compare them between runs.

(Long term we should have something like googlebenchmark for our microbenchmarking here which would runt he function just often enough to get stable timing results. Maybe by tweaking it to run a fixed number of times for the cases with an external profiling tool.)

In D30313#690815, @MatzeB wrote:

In D30313#690811, @MatzeB wrote:

Sorry to be this guy: This benchmark is running for too long! We should aim for 0.5-1s runtimes for our benchmarks and the 1000000 looks arbitrary to me. (This takes nearly 3x the time of salsa20, the next slowest benchmark in SingleSource/Benchmarks/Misc for me).

Just lowering is the way to go. Aiming for a specific wall time is contraproductive at least today, as we also have modes where we look at profile data and performance counters and want to compare them between runs.

(Long term we should have something like googlebenchmark for our microbenchmarking here which would runt he function just often enough to get stable timing results. Maybe by tweaking it to run a fixed number of times for the cases with an external profiling tool.)

For the record: This was in response to Michaels comment on llvm-commits which phabricator ignored...

In D30313#690811, @MatzeB wrote:

Sorry to be this guy: This benchmark is running for too long! We should aim for 0.5-1s runtimes for our benchmarks and the 1000000 looks arbitrary to me. (This takes nearly 3x the time of salsa20, the next slowest benchmark in SingleSource/Benchmarks/Misc for me).

FWIW, I've done an experiment a while back on a few of AArch64 and X86 machines to see what the minimum running time should be for the programs in the test-suite so that they wouldn't be noisy because they run for too short.
My experiments show that across the machines I tested on, as soon as the program runs for longer than 0.01 seconds, there's no noise because of the shortness of the program run-time. This is with using "lnt runtest nt --use-perf=1" on linux.
So, in my experience, I'd say aiming for 0.1s runtime still leaves an order of magnitude safety margin, so that may be a good execution time to aim for.

My back-of-the-envelope calculation from a bit more than a year ago is that if we could make all programs in the test-suite run for about 0.1s, the test-suite would execute about 200 times faster than today. And probably produce results of the same quality as today. See slide 26 in http://llvm.org/devmtg/2015-10/slides/Beyls-AutomatedPerformanceTrackingOfLlvmGeneratedCode.pdf. Or, in other words, it would run in about 30s instead of almost 2 hours for a single run on a Cortex-A53. It'd become feasible to have full multi-run test-suite runs for every commit.

Revision Contents

Path

Size

test-suite/

trunk/

SingleSource/

Benchmarks/

Misc/

CMakeLists.txt

1 line

LICENSE.TXT

4 lines

evalloop.c

121 lines

evalloop.reference_output

2 lines

Diff 89580

test-suite/trunk/SingleSource/Benchmarks/Misc/CMakeLists.txt

	list(APPEND LDFLAGS -lm )			list(APPEND LDFLAGS -lm )
	set(FP_TOLERANCE 0.001)			set(FP_TOLERANCE 0.001)
	set(Source			set(Source
	ReedSolomon.c			ReedSolomon.c
				evalloop.c
	fbench.c			fbench.c
	ffbench.c			ffbench.c
	flops-1.c			flops-1.c
	flops-2.c			flops-2.c
	flops-3.c			flops-3.c
	flops-4.c			flops-4.c
	flops-5.c			flops-5.c
	flops-6.c			flops-6.c
	Show All 20 Lines

test-suite/trunk/SingleSource/Benchmarks/Misc/LICENSE.TXT

	Show All 9 Lines
	(c) Copyright 1998 Painter Engineering, Inc. See source file for			(c) Copyright 1998 Painter Engineering, Inc. See source file for
	license details.			license details.

	Pi			Pi
	------------------------------------------------------------------------------			------------------------------------------------------------------------------
	This program is licensed under the LLVM license. It was written by Don Shull			This program is licensed under the LLVM license. It was written by Don Shull
	and Mark Riordan.			and Mark Riordan.

				evalloop
				------------------------------------------------------------------------------
				This is licensed under the Python Software Foundation License Version 2.
				See https://www.python.org/download/releases/2.7/license/

test-suite/trunk/SingleSource/Benchmarks/Misc/evalloop.c

				/*
				* A performance regression test for using the GCC "labels as values" extension
				* and computed gotos to optimize the interpreter loop. This is a trick that has
				* been used by, at least, python, webkit, ruby and dalvik. See:
				* http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables
				* for a detailed description.
				*
				* Some uses of this trick, notably python's, can create critical edges in the
				* control flow graph which we must break to achieve reasonable performance.
				*/

				#include <stdio.h>
				#include <stdint.h>

				#define TARGET(op) \
				L##op: \
				opcode = op; \
				case op:

				uint32_t sum = 0;

				void execute(int code) __attribute__((noinline));
				void eval(int *p) __attribute__((noinline));

				void execute(int code) {
				sum += code;
				}

				void eval(int *p) {
				static void *dispatch[32] = {
				&&L0,
				&&L1,
				&&L2,
				&&L3,
				&&L4,
				&&L5,
				&&L6,
				&&L7,
				&&L8,
				&&L9,
				&&L10,
				&&L11,
				&&L12,
				&&L13,
				&&L14,
				&&L15,
				&&L16,
				&&L17,
				&&L18,
				&&L19,
				&&L20,
				&&L21,
				&&L22,
				&&L23,
				&&L24,
				&&L25,
				&&L26,
				&&L27,
				&&L28,
				&&L29,
				&&L30,
				&&L31
				};

				int opcode = 0;
				while (1) {
				switch(*p++) {
				TARGET(0) { return; }
				TARGET(1) { execute (opcode); goto dispatch[p++]; }
				TARGET(2) { execute (opcode); goto dispatch[p++]; }
				TARGET(3) { execute (opcode); goto dispatch[p++]; }
				TARGET(4) { execute (opcode); goto dispatch[p++]; }
				TARGET(5) { execute (opcode); goto dispatch[p++]; }
				TARGET(6) { execute (opcode); goto dispatch[p++]; }
				TARGET(7) { execute (opcode); goto dispatch[p++]; }
				TARGET(8) { execute (opcode); goto dispatch[p++]; }
				TARGET(9) { execute (opcode); goto dispatch[p++]; }
				TARGET(10) { execute (opcode); goto dispatch[p++]; }
				TARGET(11) { execute (opcode); goto dispatch[p++]; }
				TARGET(12) { execute (opcode); goto dispatch[p++]; }
				TARGET(13) { execute (opcode); goto dispatch[p++]; }
				TARGET(14) { execute (opcode); goto dispatch[p++]; }
				TARGET(15) { execute (opcode); goto dispatch[p++]; }
				TARGET(16) { execute (opcode); goto dispatch[p++]; }
				TARGET(17) { execute (opcode); goto dispatch[p++]; }
				TARGET(18) { execute (opcode); goto dispatch[p++]; }
				TARGET(19) { execute (opcode); goto dispatch[p++]; }
				TARGET(20) { execute (opcode); goto dispatch[p++]; }
				TARGET(21) { execute (opcode); goto dispatch[p++]; }
				TARGET(22) { execute (opcode); goto dispatch[p++]; }
				TARGET(23) { execute (opcode); goto dispatch[p++]; }
				TARGET(24) { execute (opcode); goto dispatch[p++]; }
				TARGET(25) { execute (opcode); goto dispatch[p++]; }
				TARGET(26) { execute (opcode); goto dispatch[p++]; }
				TARGET(27) { execute (opcode); goto dispatch[p++]; }
				TARGET(28) { execute (opcode); goto dispatch[p++]; }
				TARGET(29) { execute (opcode); goto dispatch[p++]; }
				TARGET(30) { execute (opcode); goto dispatch[p++]; }
				TARGET(31) { execute (opcode); goto dispatch[p++]; }
				}
				}
				}

				int main() {
				const int BUFSIZE = 2048;
				// Initialize the command buffer. This must end with a 0, which is the
				// "exit" command for the interpreter loop.
				int cmds[BUFSIZE];
				for (int i = 0; i < BUFSIZE - 1; ++i)
				cmds[i] = i % 31 + 1;
				cmds[BUFSIZE - 1] = 0;

				// Run the interpreter loop over the buffer enough times to get a performance
				// estimate.
				for (int i = 0; i < 1000000; ++i) {
				eval(cmds);
				}

				printf("Sum: %u\n", sum);
				return 0;
				}

test-suite/trunk/SingleSource/Benchmarks/Misc/evalloop.reference_output

				Sum: 2671228928
				exit 0