This is an archive of the discontinued LLVM Phabricator instance.

Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?

Would the same logic apply to the P7?

Regardless, we'll need a test case (it would go in test/Transforms/LoopVectorize/PowerPC).

Hal,

Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?

I based this on the comment above the default case: to me, it seems that we can have 12 FP operation in the pipeline. Did you expected that number to be 6?

// For most things, modern systems have two execution units (and
// out-of-order execution).
return 2;

Would the same logic apply to the P7?

You are right (if the logic makes sense!).

In D7503#121303, @ohsallen wrote:
Hal,

Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?

I based this on the comment above the default case: to me, it seems that we can have 12 FP operation in the pipeline. Did you expected that number to be 6?
// For most things, modern systems have two execution units (and
// out-of-order execution).
return 2;

Ah, okay. The logic behind the comment was to create a reasonable default. The idea is that you interleave (which, to be clear, is what is often called modulo unrolling) by 2 to fill both functional units under the assumption that the ooo dispatching would take care of hiding instruction latency. Obviously, when you know something about the latency, you can do better.

And so you're right, if we follow that logic, then 12x would be right. Of course, except for very simple loops, we can't unroll that much because of register pressure (and I'm not entirely sure how accurate the IR-level register use estimator will be in this regard). It is also too much for integer instructions (which I imagine have lower latency?), although maybe not for vector integer ops?

In short, I'm slightly worried about setting such a large number without supporting measurements, because by the time that instruction scheduling, register allocation, and the core's ooo dispatching and dispatch group formation get involved, it might not be optimal.

Would the same logic apply to the P7?

You are right (if the logic makes sense!).

I ran benchmarks on the P7 today, and I'm fine with this change. Setting this value to 12 gives the following speedups:

MultiSource/Applications/JM/ldecod/ldecod
-49.5771% +/- 23.3244%
MultiSource/Applications/JM/lencod/lencod
-52.9663% +/- 31.49%
(and some improvement in SingleSource/Benchmarks/Adobe-C++/loop_unroll), and no significant regressions.

Please make the same change for the P7 and the P8, and add a test case in test/Transforms/LoopVectorize/PowerPC.

In D7503#122401, @hfinkel wrote:

I ran benchmarks on the P7 today, and I'm fine with this change.

Thanks Hal for benchmarking this! Committed revision 228973.

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCTargetTransformInfo.cpp

5 lines

Diff 19581

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Context not available.
	if (Directive == PPC::DIR_E500mc \|\| Directive == PPC::DIR_E5500)	if (Directive == PPC::DIR_E500mc \|\| Directive == PPC::DIR_E5500)
	return 1;	return 1;

		// For the P8, VSU instructions have a 6-cycle latency and there are two VSU
		// units, so unroll by 12x for latency hiding.
		if (Directive == PPC::DIR_PWR8)
		return 12;

	// For most things, modern systems have two execution units (and	// For most things, modern systems have two execution units (and
	// out-of-order execution).	// out-of-order execution).
	return 2;	return 2;
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

Tune TTI getMaxInterleaveFactor for POWER8Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 19581

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Tune TTI getMaxInterleaveFactor for POWER8
Needs ReviewPublic