On SystemZ it is imperative during loop unrolling that the number of stores in the resulting loop do not exceed the point where the processor can't handle them all and as a result severely slows down. This is the result when store tags run out. To avoid this during loop unrolling, the SystemZ backend counts the number of stores, and computes based on that sum the limit of number of iterations to produce.
This problem should be handled during loop vectorization as well. The loop vectorizer may decide to vectorize a loop while scalarizing a particular (store) instruction, which means the number of stores is increased. It can also perform unrolling "interleaving"), which also increases the number of stores.
In order to handle the case of scalarization, the widening decision must be available via a call to getWideningDecision(). Therefore this check could either be implemented in LoopVectorize.cpp, or the LoopVectorizationCostModel class must somehow be factored out of the file so that the target can get the InstWidening result for each store. I have begun with the simpler task of implementing this directly in LoopVectorizer, in the hope that this does not prove to be too crude to accept.
- checkVectorizationFactorForMem() must be called after expectedCost(), so that the widening decisions for each VF are available.
- Since getWideningDecision() is parameterized with VF, checkVectorizationFactorForMem() is called with each VF considered.
- limitUnrollForMem() computes the max unroll factor in a similar fashion by counting stores. I felt I had to avoid the name limitInterleaveFactorForMem, because 'interleaving' is already used (in my opinion in a confusing way) for both memory-interleaving and unrolling.