This is an archive of the discontinued LLVM Phabricator instance.

Break dependencies in large loops containing reductions (LoopVectorize)
ClosedPublic

Authored by ohsallen on Feb 9 2015, 10:44 AM.

Download Raw Diff

Details

Reviewers

anemet
nadav
aschwaighofer
hfinkel

Summary

For the code below, the loop vectorizer is expected to unroll the loop and break loop-carried dependencies by using more registers (see http://reviews.llvm.org/D7128).

for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}

Here, the inner loop is first unrolled by the regular loop unroller, which doesn't break dependencies. The loop vectorizer should then unroll the outer loop and break dependencies. But this doesn't happen, since the heuristics consider that only small loops are worth to unroll. This patch fixes the heuristics.

For the example above on POWER8, there is a 2.5x speedup. To handle all targets appropriately, I propose the following cost function to compute the unroll factor (not in this patch yet):

UF = UF * CriticalPathLength / LoopLength

UF is the unroll factor already computed up to this point, which takes into account register pressure and is bounded with the TTI max interleave factor. CriticalPathLength and LoopLength only take into account interesting operations (FP?). The cost function approximates reduces UF while ensuring that ILP opportunities are met. It works well for the example above on P8. Please tell me what you think.

Thanks,
Olivier

Diff Detail

Event Timeline

ohsallen updated this revision to Diff 19597.Feb 9 2015, 10:44 AM

ohsallen retitled this revision from to Break dependencies in large loops containing reductions (LoopVectorize).

ohsallen updated this object.

ohsallen edited the test plan for this revision. (Show Details)

ohsallen added a reviewer: hfinkel.

ohsallen added a subscriber: Unknown Object (MLST).Feb 9 2015, 11:50 AM

hfinkel added reviewers: nadav, aschwaighofer, anemet.Feb 10 2015, 9:52 AM

I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense. On some cores it is important not to make the size of loops to larger because there is a threshold penalty effect. For example, on Intel cores, there is a small buffer associated with the LSD (the loop stream detector), and we want loops to fit into that buffer. I believe this is why SmallLoopCost is currently so small (20). That having been said, there is a trade-off, and we'll need to do some tuning.

To handle all targets appropriately, I propose the following cost function to compute the unroll factor (not in this patch yet):

UF = UF * CriticalPathLength / LoopLength

I agree, this seems like essentially what you'd like to do (assuming that the original UF is set based on the number of functional units available). We don't however, want this to override the register pressure constraint. Once you start spilling in the loop, most of these considerations are more-or-less irrelevant.

Adam, Nadav, Adam, any thoughts on this?

Procedural note: Please upload full-context patches, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense.

Right, typically, we don't want to unroll for the code below. With the proposed cost function, assuming that CriticalPathLength is 1, UF is divided by LoopSize which is large.

for (...) {
    // large instruction sequence unrelated to r
    r += ...
}

On some cores it is important not to make the size of loops to larger because there is a threshold penalty effect. For example, on Intel cores, there is a small buffer associated with the LSD (the loop stream detector), and we want loops to fit into that buffer. I believe this is why SmallLoopCost is currently so small (20). That having been said, there is a trade-off, and we'll need to do some tuning.

We could compute a threshold on the UnrolledLoopSize, which would be related to LSD for Intel and to the I-cache for others targets. Then we would reduce UF until UnrolledLoopSize fits that threshold.

UF = UF * CriticalPathLength / LoopLength

I agree, this seems like essentially what you'd like to do (assuming that the original UF is set based on the number of functional units available). We don't however, want this to override the register pressure constraint. Once you start spilling in the loop, most of these considerations are more-or-less irrelevant.

UF in the right member of the assignment is already computed according to register pressure (using calculateRegisterUsage()), so I assume it took into account the fact that the loop is large. The resulting UF would be usually smaller, so I believe this would be conservative enough?

Full-context patch

In D7514#121455, @ohsallen wrote:
I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense.

Right, typically, we don't want to unroll for the code below. With the proposed cost function, assuming that CriticalPathLength is 1, UF is divided by LoopSize which is large.
for (...) {
    // large instruction sequence unrelated to r
    r += ...
}
On some cores it is important not to make the size of loops to larger because there is a threshold penalty effect. For example, on Intel cores, there is a small buffer associated with the LSD (the loop stream detector), and we want loops to fit into that buffer. I believe this is why SmallLoopCost is currently so small (20). That having been said, there is a trade-off, and we'll need to do some tuning.

We could compute a threshold on the UnrolledLoopSize, which would be related to LSD for Intel and to the I-cache for others targets. Then we would reduce UF until UnrolledLoopSize fits that threshold.

UF = UF * CriticalPathLength / LoopLength

I agree, this seems like essentially what you'd like to do (assuming that the original UF is set based on the number of functional units available). We don't however, want this to override the register pressure constraint. Once you start spilling in the loop, most of these considerations are more-or-less irrelevant.

UF in the right member of the assignment is already computed according to register pressure (using calculateRegisterUsage()), so I assume it took into account the fact that the loop is large. The resulting UF would be usually smaller, so I believe this would be conservative enough?

Maybe I'm not understanding exactly what you're proposing. Are you going to calculate the critical path length in units of instructions, or using the throughput costs, or using some latency measure?

Maybe I'm not understanding exactly what you're proposing. Are you going to calculate the critical path length in units of instructions, or using the throughput costs, or using some latency measure?

CriticalPathLength and LoopLength are both in units of instructions.

Hi Olivier,

Probably I miss somethings, but what dependencies would unrolling of the outer loop break?

for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}

I see that it can be beneficial, but it's not about dependencies, as far as I can tell (e.g. SLP probably can vectorize the unrolled body).

Hi Michael,

Probably I miss somethings, but what dependencies would unrolling of the outer loop break?

It would break the dependencies between the reduction operations. With the example below, we can dispatch twice more instructions (if target permits), which is profitable to exploit more ILP.

// Original loop.
for (int i = 0; i < n; i++) 
    for (int j = 0; j < 3; j++)
        r += arr[i][j];

// After unrolling innermost loop.
for (int i = 0; i < n; i++)  {
    r += arr[i][0];
    r += arr[i][1];
    r += arr[i][2];
}

// After unrolling outermost loop (in vectorizer, which breaks dependencies).
for (int i = 0; i < n; i += 2)  {
    r += arr[i][0];
    r_0 += arr[i+1][0];
    r += arr[i][1];
    r_0 += arr[i+1][1];
    r += arr[i][2];
    r_0 += arr[i+1][2];
}
r += r_0;

Thanks for the explanation!

In D7514#121534, @ohsallen wrote:
Hi Michael,

Probably I miss somethings, but what dependencies would unrolling of the outer loop break?

It would break the dependencies between the reduction operations. With the example below, we can dispatch twice more instructions (if target permits), which is profitable to exploit more ILP.
// Original loop.
for (int i = 0; i < n; i++) 
    for (int j = 0; j < 3; j++)
        r += arr[i][j];

// After unrolling innermost loop.
for (int i = 0; i < n; i++)  {
    r += arr[i][0];
    r += arr[i][1];
    r += arr[i][2];
}

We should do some experimentation on other architectures. An ooo core might be able to hide latency by dispatching loop iterations before the previous ones retire, however, I'm not sure what limits exist on this in general, and how we should model it. Do the instruction register dependencies from the reduction kill this in general, or only on certain targets? It might be that, even when it is possible, it is limited by register renaming resources, and if there are many reduction steps in the loop body, and each step requires a rename, then our ability to speculate is limited significantly by that. And there are in-order targets that can't speculate into future loop iterations at all. Thoughts?

// After unrolling outermost loop (in vectorizer, which breaks dependencies).
for (int i = 0; i < n; i += 2) {
r += arr[i][0];
r_0 += arr[i+1][0];
r += arr[i][1];
r_0 += arr[i+1][1];
r += arr[i][2];
r_0 += arr[i+1][2];
}
r += r_0;

Let me try to explain the rationale below the proposed cost function: UF = UF * CriticalPathLength / LoopLength

I assume CriticalPathLength is the number of reductions in the loop (or more precisely, the length of the longest reduction chain found in the loop, which would be 3 for examples above as the innermost loop is inlined).

(CriticalPathLength / LoopLength) gives us the distance between the reductions. Typically, the distance is very short for the examples above, so CriticalPathLength / LoopLength = 3 / 3 = 1, and UF remains unchanged. If the distance is long, it means there are potentially many FP instructions between each reductions, therefore an OoO pipeline will be able to exploit ILP, and we don't need to add much interleaving — UF will be diminished. And if there is no OoO, UF will be short anyways.

In D7514#121703, @ohsallen wrote:

Let me try to explain the rationale below the proposed cost function: UF = UF * CriticalPathLength / LoopLength

I assume CriticalPathLength is the number of reductions in the loop (or more precisely, the length of the longest reduction chain found in the loop, which would be 3 for examples above as the innermost loop is inlined).

(CriticalPathLength / LoopLength) gives us the distance between the reductions. Typically, the distance is very short for the examples above, so CriticalPathLength / LoopLength = 3 / 3 = 1, and UF remains unchanged. If the distance is long, it means there are potentially many FP instructions between each reductions, therefore an OoO pipeline will be able to exploit ILP, and we don't need to add much interleaving — UF will be diminished.

Okay, this sounds reasonable, please provide a patch and we'll benchmark it.

And if there is no OoO, UF will be short anyways.

No, if there is no OoO, then the UF should be set to hide pipeline latency, and so its size will depend on the depth of the pipeline (which is not necessarily small).

For OoO cores with deep pipelines, we should also be unrolling larger loops, but that's another matter.

Hal,

Okay, this sounds reasonable, please provide a patch and we'll benchmark it.

I am worried that the impact will be small, because as I said UF is bounded with TTI.getMaxInterleaveFactor(). Only with PPC, AMDGPU, AArch64 (for CortexA57) and X86 (with AVX), that bound is greater than 2 (but no greater than 4, except for AMDGPU). Typically to get the optimal UF for the example above on P8 using the proposed cost function, TTI.getMaxInterleaveFactor() should return 12 instead of 2 (see http://reviews.llvm.org/D7503). And as you said, this is a theoritical max but it's too big and might do harm because of register pressure.

Maybe we would need another TTI function that provides the theoritical max, like TTI::getMaxTheoriticalInterleaveFactor. Or we would have to multiply UF by some factor, which could be provided through a TTI function with default value 1?

In D7514#122114, @ohsallen wrote:

Hal,

Okay, this sounds reasonable, please provide a patch and we'll benchmark it.

I am worried that the impact will be small, because as I said UF is bounded with TTI.getMaxInterleaveFactor(). Only with PPC, AMDGPU, AArch64 (for CortexA57) and X86 (with AVX), that bound is greater than 2 (but no greater than 4, except for AMDGPU). Typically to get the optimal UF for the example above on P8 using the proposed cost function, TTI.getMaxInterleaveFactor() should return 12 instead of 2 (see http://reviews.llvm.org/D7503). And as you said, this is a theoritical max but it's too big and might do harm because of register pressure.

Right, that's good. We need to make sure there aren't significant regressions in performance (or code size that don't correspond to performance increases).

Maybe we would need another TTI function that provides the theoritical max, like TTI::getMaxTheoriticalInterleaveFactor. Or we would have to multiply UF by some factor, which could be provided through a TTI function with default value 1?

I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.

I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.

It seems we don't really care about the latency alone, we just want to know about ILP. As I see it, the current number takes into account ILP (that is, latency and number of functional units), and somehow register pressure, like we don't want to put a big number like 12 in there.

What I need is a number which only takes into account ILP, and that is because the cost function I propose already takes register pressure into account. Typically I want to have 12 for P7/P8. What I am suggesting is to add a new TTI function to return the max ILP available.

In D7514#122201, @ohsallen wrote:

I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.

It seems we don't really care about the latency alone, we just want to know about ILP. As I see it, the current number takes into account ILP (that is, latency and number of functional units), and somehow register pressure, like we don't want to put a big number like 12 in there.

There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.

What I need is a number which only takes into account ILP, and that is because the cost function I propose already takes register pressure into account. Typically I want to have 12 for P7/P8. What I am suggesting is to add a new TTI function to return the max ILP available.

Maybe, but even if we only care about latency combined with ILP, we sometimes are only about ILP independent of latency. Thus, I think having interfaces that return (average) instruction latency, (average) ILP, and whether or not the processor can speculate future loop iterations without dependencies (true for OoO, false for in-order, generically) etc. is the most straightforward from the backend modeling perspective.

There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.

Right, but then why not setting 12 as the max interleave factor for POWER7/POWER8? From our previous discussion (http://reviews.llvm.org/D7503), I understood that you didn't want to put 12 because of potential spillings. I feel like register pressure and available ILP should be completely separated concerns.

Maybe, but even if we only care about latency combined with ILP, we sometimes are only about ILP independent of latency. Thus, I think having interfaces that return (average) instruction latency, (average) ILP, and whether or not the processor can speculate future loop iterations without dependencies (true for OoO, false for in-order, generically) etc. is the most straightforward from the backend modeling perspective.

This sound great, I would be happy to implement that. I feel like when you talk about ILP, you're talking about number of functional units, am I right? I will try to put something together without changing too much the current policy.

Thanks

In D7514#122228, @ohsallen wrote:

There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.

Right, but then why not setting 12 as the max interleave factor for POWER7/POWER8? From our previous discussion (http://reviews.llvm.org/D7503), I understood that you didn't want to put 12 because of potential spillings. I feel like register pressure and available ILP should be completely separated concerns.

No, I simply didn't want to put 12 without benchmarking it first, because if you put 12, you're really depending on that heuristic to do a good job (because the number of allocatable registers is only a small multiple of that).

Maybe, but even if we only care about latency combined with ILP, we sometimes are only about ILP independent of latency. Thus, I think having interfaces that return (average) instruction latency, (average) ILP, and whether or not the processor can speculate future loop iterations without dependencies (true for OoO, false for in-order, generically) etc. is the most straightforward from the backend modeling perspective.

This sound great, I would be happy to implement that. I feel like when you talk about ILP, you're talking about number of functional units, am I right?

Yes.

I will try to put something together without changing too much the current policy.

Thanks

Sounds good, thanks!

This patch unrolls large loops containing reductions. The cost function discussed here was implemented. Added enableAggressiveFMAFusion to TTI to fine-tune the heuristics.

Herald added a subscriber: jholewinski. · View Herald TranscriptFeb 19 2015, 10:11 AM

Olivier, I don’t really understand this heuristics. It looks like you are adding lots of code to handle a very specific case and I am not convinced that this heuristics is useful for the general case.

Following Hal’s suggestion to benchmark this patch is a really good idea. Do you know how it affects the LLVM test suite? Are you seeing any gains or regressions?

Alright, there are several separable changes here:

The TTI change. I don't think this is the right way to solve the problem (and even if it were, I'd not change the backend to see IR types, that's not necessary, just convert the IR types into backend types (look for TLI->getTypeLegalizationCost(Ty) in the default implementation of getArithmeticInstrCost, for example)).

It seems like, in general, you want a way to measure the latency of some chain of instructions (other than just counting them). This is general problem, and I recommend going after that issue as follow-up work.

As noted below, I don't think you're counting the right thing (or at least, you don't seem to be counting what I'd expect). Can you please elaborate?

lib/Transforms/Vectorize/LoopVectorize.cpp
4592	I don't understand what this is doing. Don't you want to count the number of instructions needed to compute the value being 'added' to the reduction?
4626	This seems like an odd one-off to have here. FMAs are important, granted, but you don't even check if there are FMAs (or things likely to form FMAs) in the loop.

In D7514#127076, @hfinkel wrote:

It seems like, in general, you want a way to measure the latency of some chain of instructions (other than just counting them). This is general problem, and I recommend going after that issue as follow-up work.

Fair enough.

As noted below, I don't think you're counting the right thing (or at least, you don't seem to be counting what I'd expect). Can you please elaborate?

Don't you want to count the number of instructions needed to compute the value being 'added' to the reduction?

Right, or more precisely the number of instructions between each reduction, because even if there are not computing the value being 'added', we need to take them into account when measuring the level of ILP exposed. To compute this number, we divide the number of relevant instruction (integer of floating-point) with the length of the critical path (that is, the longer reduction chain, which is computed via U = *I++).

This seems like an odd one-off to have here. FMAs are important, granted, but you don't even check if there are FMAs (or things likely to form FMAs) in the loop.

I agree, I should be less sloppy and check that properly :-) Actually I was wondering, in general, is it okay to emulate what the DAG combiner will do later on? I wasn't inclined to do that because it seems like replication of code.. Your suggestion above about measuring the latency of some chain of instructions is probably the right approach.

In D7514#126666, @nadav wrote:

Following Hal’s suggestion to benchmark this patch is a really good idea. Do you know how it affects the LLVM test suite? Are you seeing any gains or regressions?

I'll benchmark some variants of this heuristic and come back to you.

Thanks.

I benchmarked this patch (without the multiply-add nonsense) on POWER8 and got the following speedups :

MultiSource/Benchmarks/McCat/08-main/main

-59.4169% +/- 40.2649%

MultiSource/Benchmarks/Prolangs-C/fixoutput/fixoutput

-83.6948% +/- 82.1021%

SingleSource/UnitTests/2003-05-02-DependentPHI

-33.8917% +/- 31.3964%

SingleSource/UnitTests/2003-07-06-IntOverflow

-43.784% +/- 38.5492%

And the following slowdowns:

MultiSource/Applications/kimwitu++/kc

80.6835% +/- 69.7813%

MultiSource/Applications/viterbi/viterbi

35.2072% +/- 24.8361%

MultiSource/Benchmarks/7zip/7zip-benchmark

9.8082% +/- 6.31851%

MultiSource/Benchmarks/nbench/nbench,pass,

8.43677% +/- 7.81566%

Will have to investigate whether the slowdowns are related to spills... However it seems that, if we were able to fine tune this, it would be profitable.

In D7514#128320, @ohsallen wrote:
I benchmarked this patch (without the multiply-add nonsense) on POWER8 and got the following speedups :

MultiSource/Benchmarks/McCat/08-main/main
-59.4169% +/- 40.2649%
MultiSource/Benchmarks/Prolangs-C/fixoutput/fixoutput
-83.6948% +/- 82.1021%
SingleSource/UnitTests/2003-05-02-DependentPHI
-33.8917% +/- 31.3964%
SingleSource/UnitTests/2003-07-06-IntOverflow
-43.784% +/- 38.5492%
And the following slowdowns:

MultiSource/Applications/kimwitu++/kc
80.6835% +/- 69.7813%
MultiSource/Applications/viterbi/viterbi
35.2072% +/- 24.8361%
MultiSource/Benchmarks/7zip/7zip-benchmark
9.8082% +/- 6.31851%
MultiSource/Benchmarks/nbench/nbench,pass,
8.43677% +/- 7.81566%
Will have to investigate whether the slowdowns are related to spills... However it seems that, if we were able to fine tune this, it would be profitable.

Neat; I see you're using my performance benchmarking script. :) -- I forgot to warn you, however (my fault), that I don't trust any of the numbers which have standard deviations more than about half to two-thirds of the delta. It looks like you need more samples (or quieter runs -- likely achievable by running fewer in parallel -- but more samples is likely easier). So, of these, I'd ignore MultiSource/Benchmarks/Prolangs-C/fixoutput/fixoutput, SingleSource/UnitTests/2003-05-02-DependentPHI, SingleSource/UnitTests/2003-07-06-IntOverflow, and probably all of the slowdowns (but they're significant enough to justify taking more samples -- sorry, this is not completely scientific).

You can also go into the directory containing the particular tests in question and run the 'make' commands there (that is certainly faster than re-running everything). And, lastly, you might also want to restrict your testing to those tests known to yield reasonable results for benchmarking (do this by defining BENCHMARKING_ONLY=1 on each 'make' invocation).

As a general note, I think the idea proposed is likely to yield a profitable heuristic; it is just a matter of making sure the implementation makes sense.

Hi Hal,

I ran more samples in a quieter manner, but there was no significant speedups/slowdowns. I also tried a more aggressive scheme, which was the first patch proposed here: just interleave large loops with reductions by UF, and forget about the cost function to reduce UF. Still got the same performance on POWER8 (there are about 20 tests that have a larger interleave factor doing so).

The good news is that when we interleave such large loops, there is no spilling observed on POWER8. That could be explained by the fact that there are many registers on this target, and that the register pressure heuristics are decent.

There are still specific cases (like the one originally discussed) where a very significant speedup have been observed (up to 3x). We should definitely optimize for these cases. The heuristic with the cost function is too much for what it's worth. What about allowing to interleave large loops with reductions for certain targets? We could have a TTI function for that, TTI.enableAggressiveInterleaving() for instance, that would return false except for POWER7 and POWER8, where the interleave factor can be large and have an impact on performance.

Thanks,
Olivier

In D7514#133020, @ohsallen wrote:

Hi Hal,

I ran more samples in a quieter manner, but there was no significant speedups/slowdowns. I also tried a more aggressive scheme, which was the first patch proposed here: just interleave large loops with reductions by UF, and forget about the cost function to reduce UF. Still got the same performance on POWER8 (there are about 20 tests that have a larger interleave factor doing so).

The good news is that when we interleave such large loops, there is no spilling observed on POWER8. That could be explained by the fact that there are many registers on this target, and that the register pressure heuristics are decent.

There are still specific cases (like the one originally discussed) where a very significant speedup have been observed (up to 3x). We should definitely optimize for these cases. The heuristic with the cost function is too much for what it's worth. What about allowing to interleave large loops with reductions for certain targets? We could have a TTI function for that, TTI.enableAggressiveInterleaving() for instance, that would return false except for POWER7 and POWER8, where the interleave factor can be large and have an impact on performance.

I think this is reasonable at the present time. At some point, we might have a reasonable way to model instruction throughput vs. latency, the effect of ooo cross-iteration dispatch, etc., but we don't currently. Tacking that is likely a long-term project.

Thanks,
Olivier

Here is the patch implementing the proposed solution. Thanks!

hfinkel added inline comments.Mar 5 2015, 6:30 PM

include/llvm/Analysis/TargetTransformInfo.h
334	I don't like this description. I recommend just saying: /// \brief Don't restrict Interleaved unrolling to small loops.
337	Let's make this: bool enableAggressiveInterleaving(bool hasReductions) const; because I'd like to enable this for all loops on the A2 (not just the ones with reductions)

Committed revision 231528. Thanks for your help!

Olivier

mzolotukhin removed a subscriber: mzolotukhin.May 12 2015, 6:57 PM

Olivier , Please close this.

This was accepted by Hal and then committed.

This revision is now accepted and ready to land.Apr 8 2016, 5:11 PM

anemet closed this revision.Apr 8 2016, 5:11 PM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

9 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

PowerPC/

PPCTargetTransformInfo.cpp

4 lines

Transforms/

Vectorize/

LoopVectorize.cpp

8 lines

test/

Transforms/

LoopVectorize/

PowerPC/

large-loop-rdx.ll

73 lines

Diff 21240

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 325 Lines • ▼ Show 20 Lines	public:

/// \brief Returns the target's jmp_buf size in bytes.		/// \brief Returns the target's jmp_buf size in bytes.
unsigned getJumpBufSize() const;		unsigned getJumpBufSize() const;

/// \brief Return true if switches should be turned into lookup tables for the		/// \brief Return true if switches should be turned into lookup tables for the
/// target.		/// target.
bool shouldBuildLookupTables() const;		bool shouldBuildLookupTables() const;

		/// \brief Return true if interleaving is usually beneficial for the target.
		hfinkelUnsubmitted Not Done Reply Inline Actions I don't like this description. I recommend just saying: /// \brief Don't restrict Interleaved unrolling to small loops. hfinkel: I don't like this description. I recommend just saying: /// \brief Don't restrict…
		/// This is typically the case when there is a substantial amount of
		/// registers, and hiding latency significantly improves performance.
		bool enableAggressiveInterleaving() const;
		hfinkelUnsubmitted Not Done Reply Inline Actions Let's make this: bool enableAggressiveInterleaving(bool hasReductions) const; because I'd like to enable this for all loops on the A2 (not just the ones with reductions) hfinkel: Let's make this: bool enableAggressiveInterleaving(bool hasReductions) const; because I'd…

/// \brief Return hardware support for population count.		/// \brief Return hardware support for population count.
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;

/// \brief Return true if the hardware has a fast square-root instruction.		/// \brief Return true if the hardware has a fast square-root instruction.
bool haveFastSqrt(Type *Ty) const;		bool haveFastSqrt(Type *Ty) const;

/// \brief Return the expected cost of supporting the floating point operation		/// \brief Return the expected cost of supporting the floating point operation
/// of the specified type.		/// of the specified type.
▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale) = 0;		int64_t Scale) = 0;
virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;		virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;
virtual bool isProfitableToHoist(Instruction *I) = 0;		virtual bool isProfitableToHoist(Instruction *I) = 0;
virtual bool isTypeLegal(Type *Ty) = 0;		virtual bool isTypeLegal(Type *Ty) = 0;
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
virtual unsigned getJumpBufSize() = 0;		virtual unsigned getJumpBufSize() = 0;
virtual bool shouldBuildLookupTables() = 0;		virtual bool shouldBuildLookupTables() = 0;
		virtual bool enableAggressiveInterleaving() = 0;
virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;		virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;
virtual bool haveFastSqrt(Type *Ty) = 0;		virtual bool haveFastSqrt(Type *Ty) = 0;
virtual unsigned getFPOpCost(Type *Ty) = 0;		virtual unsigned getFPOpCost(Type *Ty) = 0;
virtual unsigned getIntImmCost(const APInt &Imm, Type *Ty) = 0;		virtual unsigned getIntImmCost(const APInt &Imm, Type *Ty) = 0;
virtual unsigned getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,		virtual unsigned getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual unsigned getIntImmCost(Intrinsic::ID IID, unsigned Idx,		virtual unsigned getIntImmCost(Intrinsic::ID IID, unsigned Idx,
const APInt &Imm, Type *Ty) = 0;		const APInt &Imm, Type *Ty) = 0;
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	bool isProfitableToHoist(Instruction *I) override {
return Impl.isProfitableToHoist(I);		return Impl.isProfitableToHoist(I);
}		}
bool isTypeLegal(Type *Ty) override { return Impl.isTypeLegal(Ty); }		bool isTypeLegal(Type *Ty) override { return Impl.isTypeLegal(Ty); }
unsigned getJumpBufAlignment() override { return Impl.getJumpBufAlignment(); }		unsigned getJumpBufAlignment() override { return Impl.getJumpBufAlignment(); }
unsigned getJumpBufSize() override { return Impl.getJumpBufSize(); }		unsigned getJumpBufSize() override { return Impl.getJumpBufSize(); }
bool shouldBuildLookupTables() override {		bool shouldBuildLookupTables() override {
return Impl.shouldBuildLookupTables();		return Impl.shouldBuildLookupTables();
}		}
		bool enableAggressiveInterleaving() override {
		return Impl.enableAggressiveInterleaving();
		}
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {
return Impl.getPopcntSupport(IntTyWidthInBit);		return Impl.getPopcntSupport(IntTyWidthInBit);
}		}
bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }		bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }

unsigned getFPOpCost(Type *Ty) override {		unsigned getFPOpCost(Type *Ty) override {
return Impl.getFPOpCost(Ty);		return Impl.getFPOpCost(Ty);
}		}
▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines	public:
bool isTypeLegal(Type *Ty) { return false; }		bool isTypeLegal(Type *Ty) { return false; }

unsigned getJumpBufAlignment() { return 0; }		unsigned getJumpBufAlignment() { return 0; }

unsigned getJumpBufSize() { return 0; }		unsigned getJumpBufSize() { return 0; }

bool shouldBuildLookupTables() { return true; }		bool shouldBuildLookupTables() { return true; }

		bool enableAggressiveInterleaving() { return false; }

TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {		TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {
return TTI::PSK_Software;		return TTI::PSK_Software;
}		}

bool haveFastSqrt(Type *Ty) { return false; }		bool haveFastSqrt(Type *Ty) { return false; }

unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }		unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }

▲ Show 20 Lines • Show All 188 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
	unsigned TargetTransformInfo::getJumpBufSize() const {			unsigned TargetTransformInfo::getJumpBufSize() const {
	return TTIImpl->getJumpBufSize();			return TTIImpl->getJumpBufSize();
	}			}

	bool TargetTransformInfo::shouldBuildLookupTables() const {			bool TargetTransformInfo::shouldBuildLookupTables() const {
	return TTIImpl->shouldBuildLookupTables();			return TTIImpl->shouldBuildLookupTables();
	}			}

				bool TargetTransformInfo::enableAggressiveInterleaving() const {
				return TTIImpl->enableAggressiveInterleaving();
				}

	TargetTransformInfo::PopcntSupportKind			TargetTransformInfo::PopcntSupportKind
	TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {			TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {
	return TTIImpl->getPopcntSupport(IntTyWidthInBit);			return TTIImpl->getPopcntSupport(IntTyWidthInBit);
	}			}

	bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {			bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {
	return TTIImpl->haveFastSqrt(Ty);			return TTIImpl->haveFastSqrt(Ty);
	}			}
	▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Show First 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	if (ST->getDarwinDirective() == PPC::DIR_A2) {
// The A2 is in-order with a deep pipeline, and concatenation unrolling		// The A2 is in-order with a deep pipeline, and concatenation unrolling
// helps expose latency-hiding opportunities to the instruction scheduler.		// helps expose latency-hiding opportunities to the instruction scheduler.
UP.Partial = UP.Runtime = true;		UP.Partial = UP.Runtime = true;
}		}

BaseT::getUnrollingPreferences(L, UP);		BaseT::getUnrollingPreferences(L, UP);
}		}

		bool PPCTTIImpl::enableAggressiveInterleaving() {
		return true;
		}

unsigned PPCTTIImpl::getNumberOfRegisters(bool Vector) {		unsigned PPCTTIImpl::getNumberOfRegisters(bool Vector) {
if (Vector && !ST->hasAltivec() && !ST->hasQPX())		if (Vector && !ST->hasAltivec() && !ST->hasQPX())
return 0;		return 0;
return ST->hasVSX() ? 64 : 32;		return ST->hasVSX() ? 64 : 32;
}		}

unsigned PPCTTIImpl::getRegisterBitWidth(bool Vector) {		unsigned PPCTTIImpl::getRegisterBitWidth(bool Vector) {
if (Vector) {		if (Vector) {
▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 4,559 Lines • ▼ Show 20 Lines	if (EnableLoadStoreRuntimeUnroll && std::max(StoresUF, LoadsUF) > SmallUF) {
DEBUG(dbgs() << "LV: Unrolling to saturate store or load ports.\n");		DEBUG(dbgs() << "LV: Unrolling to saturate store or load ports.\n");
return std::max(StoresUF, LoadsUF);		return std::max(StoresUF, LoadsUF);
}		}

DEBUG(dbgs() << "LV: Unrolling to reduce branch cost.\n");		DEBUG(dbgs() << "LV: Unrolling to reduce branch cost.\n");
return SmallUF;		return SmallUF;
}		}

		// Unroll if this is a large loop (small loops are already dealt with by this
		// point) and there is a scalar reduction that could benefit from unrolling.
		if (Legal->getReductionVars()->size() &&
		TTI.enableAggressiveInterleaving()) {
		DEBUG(dbgs() << "LV: Unrolling because of reductions.\n");
		return UF;
		}

DEBUG(dbgs() << "LV: Not Unrolling.\n");		DEBUG(dbgs() << "LV: Not Unrolling.\n");
return 1;		return 1;
}		}

LoopVectorizationCostModel::RegisterUsage		LoopVectorizationCostModel::RegisterUsage
LoopVectorizationCostModel::calculateRegisterUsage() {		LoopVectorizationCostModel::calculateRegisterUsage() {
// This function calculates the register usage by measuring the highest number		// This function calculates the register usage by measuring the highest number
// of values that are alive at a single location. Obviously, this is a very		// of values that are alive at a single location. Obviously, this is a very
// rough estimation. We scan the loop in a topological order in order and		// rough estimation. We scan the loop in a topological order in order and
// assign a number to each instruction. We use RPO to ensure that defs are		// assign a number to each instruction. We use RPO to ensure that defs are
// met before their users. We assume that each instruction that has in-loop		// met before their users. We assume that each instruction that has in-loop
// users starts an interval. We record every time that an in-loop value is		// users starts an interval. We record every time that an in-loop value is
// used, so we have a list of the first and last occurrences of each		// used, so we have a list of the first and last occurrences of each
// instruction. Next, we transpose this data structure into a multi map that		// instruction. Next, we transpose this data structure into a multi map that
// holds the list of intervals that end at a specific location. This multi		// holds the list of intervals that end at a specific location. This multi
// map allows us to perform a linear search. We scan the instructions linearly		// map allows us to perform a linear search. We scan the instructions linearly
// and record each time that a new interval starts, by placing it in a set.		// and record each time that a new interval starts, by placing it in a set.
		hfinkelUnsubmitted Not Done Reply Inline Actions I don't understand what this is doing. Don't you want to count the number of instructions needed to compute the value being 'added' to the reduction? hfinkel: I don't understand what this is doing. Don't you want to count the number of instructions…
// If we find this value in the multi-map then we remove it from the set.		// If we find this value in the multi-map then we remove it from the set.
// The max register usage is the maximum size of the set.		// The max register usage is the maximum size of the set.
// We also search for instructions that are defined outside the loop, but are		// We also search for instructions that are defined outside the loop, but are
// used inside the loop. We need this number separately from the max-interval		// used inside the loop. We need this number separately from the max-interval
// usage number because when we unroll, loop-invariant values do not take		// usage number because when we unroll, loop-invariant values do not take
// more register.		// more register.
LoopBlocksDFS DFS(TheLoop);		LoopBlocksDFS DFS(TheLoop);
DFS.perform(LI);		DFS.perform(LI);
Show All 17 Lines	LoopVectorizationCostModel::calculateRegisterUsage() {

unsigned Index = 0;		unsigned Index = 0;
for (LoopBlocksDFS::RPOIterator bb = DFS.beginRPO(),		for (LoopBlocksDFS::RPOIterator bb = DFS.beginRPO(),
be = DFS.endRPO(); bb != be; ++bb) {		be = DFS.endRPO(); bb != be; ++bb) {
R.NumInstructions += (*bb)->size();		R.NumInstructions += (*bb)->size();
for (BasicBlock::iterator it = (bb)->begin(), e = (bb)->end(); it != e;		for (BasicBlock::iterator it = (bb)->begin(), e = (bb)->end(); it != e;
++it) {		++it) {
Instruction *I = it;		Instruction *I = it;
IdxToInstr[Index++] = I;		IdxToInstr[Index++] = I;
		hfinkelUnsubmitted Not Done Reply Inline Actions This seems like an odd one-off to have here. FMAs are important, granted, but you don't even check if there are FMAs (or things likely to form FMAs) in the loop. hfinkel: This seems like an odd one-off to have here. FMAs are important, granted, but you don't even…

// Save the end location of each USE.		// Save the end location of each USE.
for (unsigned i = 0; i < I->getNumOperands(); ++i) {		for (unsigned i = 0; i < I->getNumOperands(); ++i) {
Value *U = I->getOperand(i);		Value *U = I->getOperand(i);
Instruction *Instr = dyn_cast<Instruction>(U);		Instruction *Instr = dyn_cast<Instruction>(U);

// Ignore non-instruction values such as arguments, constants, etc.		// Ignore non-instruction values such as arguments, constants, etc.
if (!Instr) continue;		if (!Instr) continue;
▲ Show 20 Lines • Show All 541 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll

				; RUN: opt < %s -loop-vectorize -S \| FileCheck %s

				; CHECK: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT-NOT: fadd

				target datalayout = "e-m:e-i64:64-n32:64"
				target triple = "powerpc64le-ibm-linux-gnu"

				define void @QLA_F3_r_veq_norm2_V(float* noalias nocapture %r, [3 x { float, float }]* noalias nocapture readonly %a, i32 signext %n) #0 {
				entry:
				%cmp24 = icmp sgt i32 %n, 0
				br i1 %cmp24, label %for.cond1.preheader.preheader, label %for.end13

				for.cond1.preheader.preheader: ; preds = %entry
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.cond1.preheader.preheader, %for.cond1.preheader
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.cond1.preheader ], [ 0, %for.cond1.preheader.preheader ]
				%sum.026 = phi double [ %add10.2, %for.cond1.preheader ], [ 0.000000e+00, %for.cond1.preheader.preheader ]
				%arrayidx5.realp = getelementptr inbounds [3 x { float, float }], [3 x { float, float }]* %a, i64 %indvars.iv, i64 0, i32 0
				%arrayidx5.real = load float, float* %arrayidx5.realp, align 8
				%arrayidx5.imagp = getelementptr inbounds [3 x { float, float }], [3 x { float, float }]* %a, i64 %indvars.iv, i64 0, i32 1
				%arrayidx5.imag = load float, float* %arrayidx5.imagp, align 8
				%mul = fmul fast float %arrayidx5.real, %arrayidx5.real
				%mul9 = fmul fast float %arrayidx5.imag, %arrayidx5.imag
				%add = fadd fast float %mul9, %mul
				%conv = fpext float %add to double
				%add10 = fadd fast double %conv, %sum.026
				%arrayidx5.realp.1 = getelementptr inbounds [3 x { float, float }], [3 x { float, float }]* %a, i64 %indvars.iv, i64 1, i32 0
				%arrayidx5.real.1 = load float, float* %arrayidx5.realp.1, align 8
				%arrayidx5.imagp.1 = getelementptr inbounds [3 x { float, float }], [3 x { float, float }]* %a, i64 %indvars.iv, i64 1, i32 1
				%arrayidx5.imag.1 = load float, float* %arrayidx5.imagp.1, align 8
				%mul.1 = fmul fast float %arrayidx5.real.1, %arrayidx5.real.1
				%mul9.1 = fmul fast float %arrayidx5.imag.1, %arrayidx5.imag.1
				%add.1 = fadd fast float %mul9.1, %mul.1
				%conv.1 = fpext float %add.1 to double
				%add10.1 = fadd fast double %conv.1, %add10
				%arrayidx5.realp.2 = getelementptr inbounds [3 x { float, float }], [3 x { float, float }]* %a, i64 %indvars.iv, i64 2, i32 0
				%arrayidx5.real.2 = load float, float* %arrayidx5.realp.2, align 8
				%arrayidx5.imagp.2 = getelementptr inbounds [3 x { float, float }], [3 x { float, float }]* %a, i64 %indvars.iv, i64 2, i32 1
				%arrayidx5.imag.2 = load float, float* %arrayidx5.imagp.2, align 8
				%mul.2 = fmul fast float %arrayidx5.real.2, %arrayidx5.real.2
				%mul9.2 = fmul fast float %arrayidx5.imag.2, %arrayidx5.imag.2
				%add.2 = fadd fast float %mul9.2, %mul.2
				%conv.2 = fpext float %add.2 to double
				%add10.2 = fadd fast double %conv.2, %add10.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %n
				br i1 %exitcond, label %for.cond.for.end13_crit_edge, label %for.cond1.preheader

				for.cond.for.end13_crit_edge: ; preds = %for.cond1.preheader
				%add10.2.lcssa = phi double [ %add10.2, %for.cond1.preheader ]
				%phitmp = fptrunc double %add10.2.lcssa to float
				br label %for.end13

				for.end13: ; preds = %for.cond.for.end13_crit_edge, %entry
				%sum.0.lcssa = phi float [ %phitmp, %for.cond.for.end13_crit_edge ], [ 0.000000e+00, %entry ]
				store float %sum.0.lcssa, float* %r, align 4
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Break dependencies in large loops containing reductions (LoopVectorize)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 21240

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll

Break dependencies in large loops containing reductions (LoopVectorize)
ClosedPublic