This is an archive of the discontinued LLVM Phabricator instance.

LoopVectorizer: let target limit memory intensive loops
Needs ReviewPublic

Authored by jonpa on Mar 8 2017, 3:55 AM.

Details

Summary

On SystemZ it is imperative during loop unrolling that the number of stores in the resulting loop do not exceed the point where the processor can't handle them all and as a result severely slows down. This is the result when store tags run out. To avoid this during loop unrolling, the SystemZ backend counts the number of stores, and computes based on that sum the limit of number of iterations to produce.

This problem should be handled during loop vectorization as well. The loop vectorizer may decide to vectorize a loop while scalarizing a particular (store) instruction, which means the number of stores is increased. It can also perform unrolling "interleaving"), which also increases the number of stores.

In order to handle the case of scalarization, the widening decision must be available via a call to getWideningDecision(). Therefore this check could either be implemented in LoopVectorize.cpp, or the LoopVectorizationCostModel class must somehow be factored out of the file so that the target can get the InstWidening result for each store. I have begun with the simpler task of implementing this directly in LoopVectorizer, in the hope that this does not prove to be too crude to accept.

  • checkVectorizationFactorForMem() must be called after expectedCost(), so that the widening decisions for each VF are available.
  • Since getWideningDecision() is parameterized with VF, checkVectorizationFactorForMem() is called with each VF considered.
  • limitUnrollForMem() computes the max unroll factor in a similar fashion by counting stores. I felt I had to avoid the name limitInterleaveFactorForMem, because 'interleaving' is already used (in my opinion in a confusing way) for both memory-interleaving and unrolling.

Diff Detail

Event Timeline

jonpa created this revision.Mar 8 2017, 3:55 AM
mssimpso edited edge metadata.Mar 8 2017, 12:22 PM

I think it might be better if the store-counting part of this was located in the SystemZ TTI implementation. So the hook would look something like TTI.hookName(WideningDecions, VF). What do you think? Letting the targets see all the memory access decisions would probably be more generally useful. And I believe the only thing that would have to be exposed is the InstWidening enum.

Let me make sure that I understand: Are you trying to limit the number of store instructions/loop or are you trying to limit the numbers of stores/cycle?

jonpa updated this revision to Diff 91154.Mar 9 2017, 3:23 AM

I think it might be better if the store-counting part of this was located in the SystemZ TTI implementation. So the hook would look something like TTI.hookName(WideningDecions, VF). What do you think? Letting the targets see all the memory access decisions would probably be more generally useful. And I believe the only thing that would have to be exposed is the InstWidening enum.

I was suspecting this might be generally preferred... :-)

  • Made the new hook instead return a pair of values {Max, Count}, with the Loop, WideningDecisions and VF as arguments (put new hook by getOperandsScalarizationOverhead() and enableAggressiveInterleaving(), even though this seems to be in the 'Scalar' section of the header file which makes me think this is actually not the right place?) .
  • Factored out InstWidening and made a new class DecisionList with a member getWideningDecision() in order to avoid having to duplicate that code. Put it in TargetInstrInfo.h. Is it ok to make these in the open, or should they perhaps be wrapped in a namespace or something?
  • Moved the actual counting of stores to the SystemZ implementation.
  • Added a short debug output in case this triggers.

Let me make sure that I understand: Are you trying to limit the number of store instructions/loop or are you trying to limit the numbers of stores/cycle?

It's about limiting the general input of stores into the processor.

...

Let me make sure that I understand: Are you trying to limit the number of store instructions/loop or are you trying to limit the numbers of stores/cycle?

It's about limiting the general input of stores into the processor.

Please be very clear. Let me explain:

  1. Imagine a processor that has a special cache for loop instructions, and this loop cache can only hold N stores. In this case, it is really important that the loop have no more than N store instructions.
  2. Imagine a processor that cannot sustain one store/cycle because of limited bandwidth or other resources. Over N cycles, only M (< N) stores can be processed. This might or might not be related to the size of the store.

I think you're in situation (2) where the size of the store does not matter (since you described some kind of tag resource). In that case, please don't count stores in the loop. You'll want to get the anticipated loop cost (which is the closest thing we have to a cycle count) for each VF and go from there. You might also limit the VF based on the number of stores that would appear in a row, because that will always exceed your limit and stall something.

If you're in (1), then your limit probably makes sense only for loops that will fit in the relevant cache.

In short, vectorized loops can be quite large and, whatever you're doing, just counting stores sounds like an insufficiently-precise model.

uweigand edited edge metadata.Mar 9 2017, 6:56 AM

Please be very clear. Let me explain:

  1. Imagine a processor that has a special cache for loop instructions, and this loop cache can only hold N stores. In this case, it is really important that the loop have no more than N store instructions.
  2. Imagine a processor that cannot sustain one store/cycle because of limited bandwidth or other resources. Over N cycles, only M (< N) stores can be processed. This might or might not be related to the size of the store.

I think you're in situation (2) where the size of the store does not matter (since you described some kind of tag resource). In that case, please don't count stores in the loop. You'll want to get the anticipated loop cost (which is the closest thing we have to a cycle count) for each VF and go from there. You might also limit the VF based on the number of stores that would appear in a row, because that will always exceed your limit and stall something.

If you're in (1), then your limit probably makes sense only for loops that will fit in the relevant cache.

In short, vectorized loops can be quite large and, whatever you're doing, just counting stores sounds like an insufficiently-precise model.

The specific issue is neither quite (1) nor (2) as described above, but rather the following: while stores are "in flight" during the out-of-order execution, each store needs to be tagged with a store tag, which is used to make sure the out-of-order execution doesn't violate ordering constraints implied by the original instruction order. There is only a limited number of those store tags available, and if this is exceeded, the processor may have to flush the pipeline.

To avoid this, we want to ensure that at any point, the number of stores "in flight" simultaneously does not exceed this limit. The problem is that it is hard to predict *precisely* when each store is actually in flight, this depends on decisions of the out-of-order engine (e.g. when which instruction is issued) that are in general impossibly to predict in the compiler. So we have to use some heuristic to achieve the effect we want in most cases.

When we originally looked into this issue, this was in the context of loop unrolling. We wanted to unroll loops, in particular small loops, in order to increase overall the execution rate, e.g. due to reducing the total number of branches (and thereby pressure on the branch prediction subsystem). However, in some cases we noted that the unrolled loop performed worse than the original one. Investigation showed that due to the overall higher execution rate the number of stores simultaneously in flight was also increased, and now exceeded the limit of store tags, causing pipeline flushes.

To fix this, we wanted to stop loop unrolling before that happened, and tried to design a heuristic to detect that case. Using a heuristic modeled after your situation (2) didn't work well, since the ratio of stores to total instructions is mostly unaffected by unrolling; this metric doesn't track the secondary effects that were causing the difference we're seeing (e.g. prediction rate of the loop backwards branch). So we ended up with a simple heuristic modeled after your situation (1), i.e. simply don't unroll loops (any further) if the total number of stores in the loop exceeds a threshold.

Now of course in a very large loop, the presence of more stores than this threshold would be unlikely to cause any problems, and in that sense the heuristic is very conservative. However, for the purpose of loop unrolling this was acceptable, since for large loops it generally doesn't matter anyway if they're unrolled or not -- only for small loops is it really critical.

However, I can see your point that for a *vectorization* decision, this last statement is no longer true; we certainly do want to vectorize large loops! Jonas, I think this means we may indeed have to do something better than just counting the total number of stores in the loop ... (As always with these heuristics, this probably means tuning the parameters via actual measurements.)

Please be very clear. Let me explain:

  1. Imagine a processor that has a special cache for loop instructions, and this loop cache can only hold N stores. In this case, it is really important that the loop have no more than N store instructions.
  2. Imagine a processor that cannot sustain one store/cycle because of limited bandwidth or other resources. Over N cycles, only M (< N) stores can be processed. This might or might not be related to the size of the store.

I think you're in situation (2) where the size of the store does not matter (since you described some kind of tag resource). In that case, please don't count stores in the loop. You'll want to get the anticipated loop cost (which is the closest thing we have to a cycle count) for each VF and go from there. You might also limit the VF based on the number of stores that would appear in a row, because that will always exceed your limit and stall something.

If you're in (1), then your limit probably makes sense only for loops that will fit in the relevant cache.

In short, vectorized loops can be quite large and, whatever you're doing, just counting stores sounds like an insufficiently-precise model.

The specific issue is neither quite (1) nor (2) as described above, but rather the following: while stores are "in flight" during the out-of-order execution, each store needs to be tagged with a store tag, which is used to make sure the out-of-order execution doesn't violate ordering constraints implied by the original instruction order. There is only a limited number of those store tags available, and if this is exceeded, the processor may have to flush the pipeline.

To avoid this, we want to ensure that at any point, the number of stores "in flight" simultaneously does not exceed this limit. The problem is that it is hard to predict *precisely* when each store is actually in flight, this depends on decisions of the out-of-order engine (e.g. when which instruction is issued) that are in general impossibly to predict in the compiler. So we have to use some heuristic to achieve the effect we want in most cases.

When we originally looked into this issue, this was in the context of loop unrolling. We wanted to unroll loops, in particular small loops, in order to increase overall the execution rate, e.g. due to reducing the total number of branches (and thereby pressure on the branch prediction subsystem). However, in some cases we noted that the unrolled loop performed worse than the original one. Investigation showed that due to the overall higher execution rate the number of stores simultaneously in flight was also increased, and now exceeded the limit of store tags, causing pipeline flushes.

To fix this, we wanted to stop loop unrolling before that happened, and tried to design a heuristic to detect that case. Using a heuristic modeled after your situation (2) didn't work well, since the ratio of stores to total instructions is mostly unaffected by unrolling; this metric doesn't track the secondary effects that were causing the difference we're seeing (e.g. prediction rate of the loop backwards branch). So we ended up with a simple heuristic modeled after your situation (1), i.e. simply don't unroll loops (any further) if the total number of stores in the loop exceeds a threshold.

Now of course in a very large loop, the presence of more stores than this threshold would be unlikely to cause any problems, and in that sense the heuristic is very conservative. However, for the purpose of loop unrolling this was acceptable, since for large loops it generally doesn't matter anyway if they're unrolled or not -- only for small loops is it really critical.

However, I can see your point that for a *vectorization* decision, this last statement is no longer true; we certainly do want to vectorize large loops! Jonas, I think this means we may indeed have to do something better than just counting the total number of stores in the loop ... (As always with these heuristics, this probably means tuning the parameters via actual measurements.)

This makes a lot of sense to me, thanks for explaining!

mkuper added a subscriber: mkuper.Mar 9 2017, 10:27 PM
jonpa updated this revision to Diff 92616.Mar 22 2017, 3:44 AM

I reworked this so that the decisions for VF and IC are handled separately:

New method checkVectorizationFactorForMem()
New argument: unsigned getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions = nullptr) const;

< Jonas, I think this means we may indeed have to do something better than just counting the total number of stores in the loop ...
See SystemZTTIImpl::checkVectorizationFactorForMem()