Page MenuHomePhabricator

Tune TTI getMaxInterleaveFactor for POWER8
Needs ReviewPublic

Authored by ohsallen on Feb 9 2015, 8:11 AM.

Details

Reviewers
hfinkel
Summary

For the P8, VSU instructions have a 6-cycle latency and there are two VSU units, so unroll by 12x for latency hiding.

Diff Detail

Event Timeline

ohsallen updated this revision to Diff 19581.Feb 9 2015, 8:11 AM
ohsallen retitled this revision from to Tune TTI getMaxInterleaveFactor for POWER8.
ohsallen updated this object.
ohsallen edited the test plan for this revision. (Show Details)
ohsallen added a reviewer: hfinkel.
ohsallen added a subscriber: Unknown Object (MLST).Feb 9 2015, 11:50 AM
hfinkel edited edge metadata.Feb 10 2015, 6:00 AM

Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?

Would the same logic apply to the P7?

Regardless, we'll need a test case (it would go in test/Transforms/LoopVectorize/PowerPC).

Hal,

Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?

I based this on the comment above the default case: to me, it seems that we can have 12 FP operation in the pipeline. Did you expected that number to be 6?

// For most things, modern systems have two execution units (and
// out-of-order execution).
return 2;

Would the same logic apply to the P7?

You are right (if the logic makes sense!).

Hal,

Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?

I based this on the comment above the default case: to me, it seems that we can have 12 FP operation in the pipeline. Did you expected that number to be 6?

// For most things, modern systems have two execution units (and
// out-of-order execution).
return 2;

Ah, okay. The logic behind the comment was to create a reasonable default. The idea is that you interleave (which, to be clear, is what is often called modulo unrolling) by 2 to fill both functional units under the assumption that the ooo dispatching would take care of hiding instruction latency. Obviously, when you know something about the latency, you can do better.

And so you're right, if we follow that logic, then 12x would be right. Of course, except for very simple loops, we can't unroll that much because of register pressure (and I'm not entirely sure how accurate the IR-level register use estimator will be in this regard). It is also too much for integer instructions (which I imagine have lower latency?), although maybe not for vector integer ops?

In short, I'm slightly worried about setting such a large number without supporting measurements, because by the time that instruction scheduling, register allocation, and the core's ooo dispatching and dispatch group formation get involved, it might not be optimal.

Would the same logic apply to the P7?

You are right (if the logic makes sense!).

I ran benchmarks on the P7 today, and I'm fine with this change. Setting this value to 12 gives the following speedups:

MultiSource/Applications/JM/ldecod/ldecod
-49.5771% +/- 23.3244%
MultiSource/Applications/JM/lencod/lencod
-52.9663% +/- 31.49%
(and some improvement in SingleSource/Benchmarks/Adobe-C++/loop_unroll), and no significant regressions.

Please make the same change for the P7 and the P8, and add a test case in test/Transforms/LoopVectorize/PowerPC.

I ran benchmarks on the P7 today, and I'm fine with this change.

Thanks Hal for benchmarking this! Committed revision 228973.