This is an archive of the discontinued LLVM Phabricator instance.

Break dependencies in large loops containing reductions (LoopVectorize)
ClosedPublic

Authored by ohsallen on Feb 9 2015, 10:44 AM.

Details

Summary

For the code below, the loop vectorizer is expected to unroll the loop and break loop-carried dependencies by using more registers (see http://reviews.llvm.org/D7128).

for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}

Here, the inner loop is first unrolled by the regular loop unroller, which doesn't break dependencies. The loop vectorizer should then unroll the outer loop and break dependencies. But this doesn't happen, since the heuristics consider that only small loops are worth to unroll. This patch fixes the heuristics.

For the example above on POWER8, there is a 2.5x speedup. To handle all targets appropriately, I propose the following cost function to compute the unroll factor (not in this patch yet):

UF = UF * CriticalPathLength / LoopLength

UF is the unroll factor already computed up to this point, which takes into account register pressure and is bounded with the TTI max interleave factor. CriticalPathLength and LoopLength only take into account interesting operations (FP?). The cost function approximates reduces UF while ensuring that ILP opportunities are met. It works well for the example above on P8. Please tell me what you think.

Thanks,
Olivier

Diff Detail

Event Timeline

ohsallen updated this revision to Diff 19597.Feb 9 2015, 10:44 AM
ohsallen retitled this revision from to Break dependencies in large loops containing reductions (LoopVectorize).
ohsallen updated this object.
ohsallen edited the test plan for this revision. (Show Details)
ohsallen added a reviewer: hfinkel.
ohsallen added a subscriber: Unknown Object (MLST).Feb 9 2015, 11:50 AM
hfinkel edited edge metadata.Feb 10 2015, 10:22 AM

I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense. On some cores it is important not to make the size of loops to larger because there is a threshold penalty effect. For example, on Intel cores, there is a small buffer associated with the LSD (the loop stream detector), and we want loops to fit into that buffer. I believe this is why SmallLoopCost is currently so small (20). That having been said, there is a trade-off, and we'll need to do some tuning.

To handle all targets appropriately, I propose the following cost function to compute the unroll factor (not in this patch yet):

UF = UF * CriticalPathLength / LoopLength

I agree, this seems like essentially what you'd like to do (assuming that the original UF is set based on the number of functional units available). We don't however, want this to override the register pressure constraint. Once you start spilling in the loop, most of these considerations are more-or-less irrelevant.

Adam, Nadav, Adam, any thoughts on this?

Procedural note: Please upload full-context patches, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense.

Right, typically, we don't want to unroll for the code below. With the proposed cost function, assuming that CriticalPathLength is 1, UF is divided by LoopSize which is large.

for (...) {
    // large instruction sequence unrelated to r
    r += ...
}

On some cores it is important not to make the size of loops to larger because there is a threshold penalty effect. For example, on Intel cores, there is a small buffer associated with the LSD (the loop stream detector), and we want loops to fit into that buffer. I believe this is why SmallLoopCost is currently so small (20). That having been said, there is a trade-off, and we'll need to do some tuning.

We could compute a threshold on the UnrolledLoopSize, which would be related to LSD for Intel and to the I-cache for others targets. Then we would reduce UF until UnrolledLoopSize fits that threshold.

UF = UF * CriticalPathLength / LoopLength

I agree, this seems like essentially what you'd like to do (assuming that the original UF is set based on the number of functional units available). We don't however, want this to override the register pressure constraint. Once you start spilling in the loop, most of these considerations are more-or-less irrelevant.

UF in the right member of the assignment is already computed according to register pressure (using calculateRegisterUsage()), so I assume it took into account the fact that the loop is large. The resulting UF would be usually smaller, so I believe this would be conservative enough?

ohsallen updated this revision to Diff 19696.Feb 10 2015, 10:59 AM
ohsallen edited edge metadata.

Full-context patch

I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense.

Right, typically, we don't want to unroll for the code below. With the proposed cost function, assuming that CriticalPathLength is 1, UF is divided by LoopSize which is large.

for (...) {
    // large instruction sequence unrelated to r
    r += ...
}

On some cores it is important not to make the size of loops to larger because there is a threshold penalty effect. For example, on Intel cores, there is a small buffer associated with the LSD (the loop stream detector), and we want loops to fit into that buffer. I believe this is why SmallLoopCost is currently so small (20). That having been said, there is a trade-off, and we'll need to do some tuning.

We could compute a threshold on the UnrolledLoopSize, which would be related to LSD for Intel and to the I-cache for others targets. Then we would reduce UF until UnrolledLoopSize fits that threshold.

UF = UF * CriticalPathLength / LoopLength

I agree, this seems like essentially what you'd like to do (assuming that the original UF is set based on the number of functional units available). We don't however, want this to override the register pressure constraint. Once you start spilling in the loop, most of these considerations are more-or-less irrelevant.

UF in the right member of the assignment is already computed according to register pressure (using calculateRegisterUsage()), so I assume it took into account the fact that the loop is large. The resulting UF would be usually smaller, so I believe this would be conservative enough?

Maybe I'm not understanding exactly what you're proposing. Are you going to calculate the critical path length in units of instructions, or using the throughput costs, or using some latency measure?

Maybe I'm not understanding exactly what you're proposing. Are you going to calculate the critical path length in units of instructions, or using the throughput costs, or using some latency measure?

CriticalPathLength and LoopLength are both in units of instructions.

Hi Olivier,

Probably I miss somethings, but what dependencies would unrolling of the outer loop break?

for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}

I see that it can be beneficial, but it's not about dependencies, as far as I can tell (e.g. SLP probably can vectorize the unrolled body).

Hi Michael,

Probably I miss somethings, but what dependencies would unrolling of the outer loop break?

It would break the dependencies between the reduction operations. With the example below, we can dispatch twice more instructions (if target permits), which is profitable to exploit more ILP.

// Original loop.
for (int i = 0; i < n; i++) 
    for (int j = 0; j < 3; j++)
        r += arr[i][j];

// After unrolling innermost loop.
for (int i = 0; i < n; i++)  {
    r += arr[i][0];
    r += arr[i][1];
    r += arr[i][2];
}

// After unrolling outermost loop (in vectorizer, which breaks dependencies).
for (int i = 0; i < n; i += 2)  {
    r += arr[i][0];
    r_0 += arr[i+1][0];
    r += arr[i][1];
    r_0 += arr[i+1][1];
    r += arr[i][2];
    r_0 += arr[i+1][2];
}
r += r_0;

Thanks for the explanation!

Hi Michael,

Probably I miss somethings, but what dependencies would unrolling of the outer loop break?

It would break the dependencies between the reduction operations. With the example below, we can dispatch twice more instructions (if target permits), which is profitable to exploit more ILP.

// Original loop.
for (int i = 0; i < n; i++) 
    for (int j = 0; j < 3; j++)
        r += arr[i][j];

// After unrolling innermost loop.
for (int i = 0; i < n; i++)  {
    r += arr[i][0];
    r += arr[i][1];
    r += arr[i][2];
}

We should do some experimentation on other architectures. An ooo core might be able to hide latency by dispatching loop iterations before the previous ones retire, however, I'm not sure what limits exist on this in general, and how we should model it. Do the instruction register dependencies from the reduction kill this in general, or only on certain targets? It might be that, even when it is possible, it is limited by register renaming resources, and if there are many reduction steps in the loop body, and each step requires a rename, then our ability to speculate is limited significantly by that. And there are in-order targets that can't speculate into future loop iterations at all. Thoughts?

// After unrolling outermost loop (in vectorizer, which breaks dependencies).
for (int i = 0; i < n; i += 2) {

r += arr[i][0];
r_0 += arr[i+1][0];
r += arr[i][1];
r_0 += arr[i+1][1];
r += arr[i][2];
r_0 += arr[i+1][2];

}
r += r_0;

Let me try to explain the rationale below the proposed cost function: UF = UF * CriticalPathLength / LoopLength

I assume CriticalPathLength is the number of reductions in the loop (or more precisely, the length of the longest reduction chain found in the loop, which would be 3 for examples above as the innermost loop is inlined).

(CriticalPathLength / LoopLength) gives us the distance between the reductions. Typically, the distance is very short for the examples above, so CriticalPathLength / LoopLength = 3 / 3 = 1, and UF remains unchanged. If the distance is long, it means there are potentially many FP instructions between each reductions, therefore an OoO pipeline will be able to exploit ILP, and we don't need to add much interleaving — UF will be diminished. And if there is no OoO, UF will be short anyways.

Let me try to explain the rationale below the proposed cost function: UF = UF * CriticalPathLength / LoopLength

I assume CriticalPathLength is the number of reductions in the loop (or more precisely, the length of the longest reduction chain found in the loop, which would be 3 for examples above as the innermost loop is inlined).

(CriticalPathLength / LoopLength) gives us the distance between the reductions. Typically, the distance is very short for the examples above, so CriticalPathLength / LoopLength = 3 / 3 = 1, and UF remains unchanged. If the distance is long, it means there are potentially many FP instructions between each reductions, therefore an OoO pipeline will be able to exploit ILP, and we don't need to add much interleaving — UF will be diminished.

Okay, this sounds reasonable, please provide a patch and we'll benchmark it.

And if there is no OoO, UF will be short anyways.

No, if there is no OoO, then the UF should be set to hide pipeline latency, and so its size will depend on the depth of the pipeline (which is not necessarily small).

For OoO cores with deep pipelines, we should also be unrolling larger loops, but that's another matter.

Hal,

Okay, this sounds reasonable, please provide a patch and we'll benchmark it.

I am worried that the impact will be small, because as I said UF is bounded with TTI.getMaxInterleaveFactor(). Only with PPC, AMDGPU, AArch64 (for CortexA57) and X86 (with AVX), that bound is greater than 2 (but no greater than 4, except for AMDGPU). Typically to get the optimal UF for the example above on P8 using the proposed cost function, TTI.getMaxInterleaveFactor() should return 12 instead of 2 (see http://reviews.llvm.org/D7503). And as you said, this is a theoritical max but it's too big and might do harm because of register pressure.

Maybe we would need another TTI function that provides the theoritical max, like TTI::getMaxTheoriticalInterleaveFactor. Or we would have to multiply UF by some factor, which could be provided through a TTI function with default value 1?

Hal,

Okay, this sounds reasonable, please provide a patch and we'll benchmark it.

I am worried that the impact will be small, because as I said UF is bounded with TTI.getMaxInterleaveFactor(). Only with PPC, AMDGPU, AArch64 (for CortexA57) and X86 (with AVX), that bound is greater than 2 (but no greater than 4, except for AMDGPU). Typically to get the optimal UF for the example above on P8 using the proposed cost function, TTI.getMaxInterleaveFactor() should return 12 instead of 2 (see http://reviews.llvm.org/D7503). And as you said, this is a theoritical max but it's too big and might do harm because of register pressure.

Right, that's good. We need to make sure there aren't significant regressions in performance (or code size that don't correspond to performance increases).

Maybe we would need another TTI function that provides the theoritical max, like TTI::getMaxTheoriticalInterleaveFactor. Or we would have to multiply UF by some factor, which could be provided through a TTI function with default value 1?

I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.

I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.

It seems we don't really care about the latency alone, we just want to know about ILP. As I see it, the current number takes into account ILP (that is, latency and number of functional units), and somehow register pressure, like we don't want to put a big number like 12 in there.

What I need is a number which only takes into account ILP, and that is because the cost function I propose already takes register pressure into account. Typically I want to have 12 for P7/P8. What I am suggesting is to add a new TTI function to return the max ILP available.

I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.

It seems we don't really care about the latency alone, we just want to know about ILP. As I see it, the current number takes into account ILP (that is, latency and number of functional units), and somehow register pressure, like we don't want to put a big number like 12 in there.

There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.

What I need is a number which only takes into account ILP, and that is because the cost function I propose already takes register pressure into account. Typically I want to have 12 for P7/P8. What I am suggesting is to add a new TTI function to return the max ILP available.

Maybe, but even if we only care about latency combined with ILP, we sometimes are only about ILP independent of latency. Thus, I think having interfaces that return (average) instruction latency, (average) ILP, and whether or not the processor can speculate future loop iterations without dependencies (true for OoO, false for in-order, generically) etc. is the most straightforward from the backend modeling perspective.

There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.

Right, but then why not setting 12 as the max interleave factor for POWER7/POWER8? From our previous discussion (http://reviews.llvm.org/D7503), I understood that you didn't want to put 12 because of potential spillings. I feel like register pressure and available ILP should be completely separated concerns.

Maybe, but even if we only care about latency combined with ILP, we sometimes are only about ILP independent of latency. Thus, I think having interfaces that return (average) instruction latency, (average) ILP, and whether or not the processor can speculate future loop iterations without dependencies (true for OoO, false for in-order, generically) etc. is the most straightforward from the backend modeling perspective.

This sound great, I would be happy to implement that. I feel like when you talk about ILP, you're talking about number of functional units, am I right? I will try to put something together without changing too much the current policy.

Thanks

There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.

Right, but then why not setting 12 as the max interleave factor for POWER7/POWER8? From our previous discussion (http://reviews.llvm.org/D7503), I understood that you didn't want to put 12 because of potential spillings. I feel like register pressure and available ILP should be completely separated concerns.

No, I simply didn't want to put 12 without benchmarking it first, because if you put 12, you're really depending on that heuristic to do a good job (because the number of allocatable registers is only a small multiple of that).

Maybe, but even if we only care about latency combined with ILP, we sometimes are only about ILP independent of latency. Thus, I think having interfaces that return (average) instruction latency, (average) ILP, and whether or not the processor can speculate future loop iterations without dependencies (true for OoO, false for in-order, generically) etc. is the most straightforward from the backend modeling perspective.

This sound great, I would be happy to implement that. I feel like when you talk about ILP, you're talking about number of functional units, am I right?

Yes.

I will try to put something together without changing too much the current policy.

Thanks

Sounds good, thanks!

ohsallen updated this revision to Diff 20312.Feb 19 2015, 10:11 AM

This patch unrolls large loops containing reductions. The cost function discussed here was implemented. Added enableAggressiveFMAFusion to TTI to fine-tune the heuristics.

nadav edited edge metadata.Feb 19 2015, 10:23 AM

Olivier, I don’t really understand this heuristics. It looks like you are adding lots of code to handle a very specific case and I am not convinced that this heuristics is useful for the general case.

Following Hal’s suggestion to benchmark this patch is a really good idea. Do you know how it affects the LLVM test suite? Are you seeing any gains or regressions?

Alright, there are several separable changes here:

  1. The TTI change. I don't think this is the right way to solve the problem (and even if it were, I'd not change the backend to see IR types, that's not necessary, just convert the IR types into backend types (look for TLI->getTypeLegalizationCost(Ty) in the default implementation of getArithmeticInstrCost, for example)).

    It seems like, in general, you want a way to measure the latency of some chain of instructions (other than just counting them). This is general problem, and I recommend going after that issue as follow-up work.
  1. As noted below, I don't think you're counting the right thing (or at least, you don't seem to be counting what I'd expect). Can you please elaborate?
lib/Transforms/Vectorize/LoopVectorize.cpp
4592

I don't understand what this is doing. Don't you want to count the number of instructions needed to compute the value being 'added' to the reduction?

4626

This seems like an odd one-off to have here. FMAs are important, granted, but you don't even check if there are FMAs (or things likely to form FMAs) in the loop.

It seems like, in general, you want a way to measure the latency of some chain of instructions (other than just counting them). This is general problem, and I recommend going after that issue as follow-up work.

Fair enough.

  1. As noted below, I don't think you're counting the right thing (or at least, you don't seem to be counting what I'd expect). Can you please elaborate?

Don't you want to count the number of instructions needed to compute the value being 'added' to the reduction?

Right, or more precisely the number of instructions between each reduction, because even if there are not computing the value being 'added', we need to take them into account when measuring the level of ILP exposed. To compute this number, we divide the number of relevant instruction (integer of floating-point) with the length of the critical path (that is, the longer reduction chain, which is computed via U = *I++).

This seems like an odd one-off to have here. FMAs are important, granted, but you don't even check if there are FMAs (or things likely to form FMAs) in the loop.

I agree, I should be less sloppy and check that properly :-) Actually I was wondering, in general, is it okay to emulate what the DAG combiner will do later on? I wasn't inclined to do that because it seems like replication of code.. Your suggestion above about measuring the latency of some chain of instructions is probably the right approach.

In D7514#126666, @nadav wrote:

Following Hal’s suggestion to benchmark this patch is a really good idea. Do you know how it affects the LLVM test suite? Are you seeing any gains or regressions?

I'll benchmark some variants of this heuristic and come back to you.

Thanks.

I benchmarked this patch (without the multiply-add nonsense) on POWER8 and got the following speedups :

MultiSource/Benchmarks/McCat/08-main/main

-59.4169% +/- 40.2649%

MultiSource/Benchmarks/Prolangs-C/fixoutput/fixoutput

-83.6948% +/- 82.1021%

SingleSource/UnitTests/2003-05-02-DependentPHI

-33.8917% +/- 31.3964%

SingleSource/UnitTests/2003-07-06-IntOverflow

-43.784% +/- 38.5492%

And the following slowdowns:

MultiSource/Applications/kimwitu++/kc

80.6835% +/- 69.7813%

MultiSource/Applications/viterbi/viterbi

35.2072% +/- 24.8361%

MultiSource/Benchmarks/7zip/7zip-benchmark

9.8082% +/- 6.31851%

MultiSource/Benchmarks/nbench/nbench,pass,

8.43677% +/- 7.81566%

Will have to investigate whether the slowdowns are related to spills... However it seems that, if we were able to fine tune this, it would be profitable.

I benchmarked this patch (without the multiply-add nonsense) on POWER8 and got the following speedups :

MultiSource/Benchmarks/McCat/08-main/main

-59.4169% +/- 40.2649%

MultiSource/Benchmarks/Prolangs-C/fixoutput/fixoutput

-83.6948% +/- 82.1021%

SingleSource/UnitTests/2003-05-02-DependentPHI

-33.8917% +/- 31.3964%

SingleSource/UnitTests/2003-07-06-IntOverflow

-43.784% +/- 38.5492%

And the following slowdowns:

MultiSource/Applications/kimwitu++/kc

80.6835% +/- 69.7813%

MultiSource/Applications/viterbi/viterbi

35.2072% +/- 24.8361%

MultiSource/Benchmarks/7zip/7zip-benchmark

9.8082% +/- 6.31851%

MultiSource/Benchmarks/nbench/nbench,pass,

8.43677% +/- 7.81566%

Will have to investigate whether the slowdowns are related to spills... However it seems that, if we were able to fine tune this, it would be profitable.

Neat; I see you're using my performance benchmarking script. :) -- I forgot to warn you, however (my fault), that I don't trust any of the numbers which have standard deviations more than about half to two-thirds of the delta. It looks like you need more samples (or quieter runs -- likely achievable by running fewer in parallel -- but more samples is likely easier). So, of these, I'd ignore MultiSource/Benchmarks/Prolangs-C/fixoutput/fixoutput, SingleSource/UnitTests/2003-05-02-DependentPHI, SingleSource/UnitTests/2003-07-06-IntOverflow, and probably all of the slowdowns (but they're significant enough to justify taking more samples -- sorry, this is not completely scientific).

You can also go into the directory containing the particular tests in question and run the 'make' commands there (that is certainly faster than re-running everything). And, lastly, you might also want to restrict your testing to those tests known to yield reasonable results for benchmarking (do this by defining BENCHMARKING_ONLY=1 on each 'make' invocation).

As a general note, I think the idea proposed is likely to yield a profitable heuristic; it is just a matter of making sure the implementation makes sense.

Hi Hal,

I ran more samples in a quieter manner, but there was no significant speedups/slowdowns. I also tried a more aggressive scheme, which was the first patch proposed here: just interleave large loops with reductions by UF, and forget about the cost function to reduce UF. Still got the same performance on POWER8 (there are about 20 tests that have a larger interleave factor doing so).

The good news is that when we interleave such large loops, there is no spilling observed on POWER8. That could be explained by the fact that there are many registers on this target, and that the register pressure heuristics are decent.

There are still specific cases (like the one originally discussed) where a very significant speedup have been observed (up to 3x). We should definitely optimize for these cases. The heuristic with the cost function is too much for what it's worth. What about allowing to interleave large loops with reductions for certain targets? We could have a TTI function for that, TTI.enableAggressiveInterleaving() for instance, that would return false except for POWER7 and POWER8, where the interleave factor can be large and have an impact on performance.

Thanks,
Olivier

Hi Hal,

I ran more samples in a quieter manner, but there was no significant speedups/slowdowns. I also tried a more aggressive scheme, which was the first patch proposed here: just interleave large loops with reductions by UF, and forget about the cost function to reduce UF. Still got the same performance on POWER8 (there are about 20 tests that have a larger interleave factor doing so).

The good news is that when we interleave such large loops, there is no spilling observed on POWER8. That could be explained by the fact that there are many registers on this target, and that the register pressure heuristics are decent.

There are still specific cases (like the one originally discussed) where a very significant speedup have been observed (up to 3x). We should definitely optimize for these cases. The heuristic with the cost function is too much for what it's worth. What about allowing to interleave large loops with reductions for certain targets? We could have a TTI function for that, TTI.enableAggressiveInterleaving() for instance, that would return false except for POWER7 and POWER8, where the interleave factor can be large and have an impact on performance.

I think this is reasonable at the present time. At some point, we might have a reasonable way to model instruction throughput vs. latency, the effect of ooo cross-iteration dispatch, etc., but we don't currently. Tacking that is likely a long-term project.

Thanks,
Olivier

ohsallen updated this revision to Diff 21240.Mar 4 2015, 3:55 PM
ohsallen edited edge metadata.

Here is the patch implementing the proposed solution. Thanks!

hfinkel added inline comments.Mar 5 2015, 6:30 PM
include/llvm/Analysis/TargetTransformInfo.h
334

I don't like this description. I recommend just saying:

/// \brief Don't restrict Interleaved unrolling to small loops.
337

Let's make this:

bool enableAggressiveInterleaving(bool hasReductions) const;

because I'd like to enable this for all loops on the A2 (not just the ones with reductions)

Committed revision 231528. Thanks for your help!

Olivier

anemet edited edge metadata.Aug 25 2015, 11:07 AM

Olivier , Please close this.

anemet accepted this revision.Apr 8 2016, 5:11 PM
anemet edited edge metadata.

This was accepted by Hal and then committed.

This revision is now accepted and ready to land.Apr 8 2016, 5:11 PM
anemet closed this revision.Apr 8 2016, 5:11 PM