The max VF is based on the smallest and widest types in the loop, which is returned by getSmallestAndWidestTypes(). As an example, on SystemZ a vector register is 128 bits. So if a i64 or double is defined, the feasible Max VF is 2.
My observation is that if a load is in effect truncated, it would make sense to see if the VF derived from the truncated type is also evaluated with cost analysis. For instace, if a load of double is truncated to float, VF 4 could be more interesting than VF 2, so MaxVF should be 4. Note that this only means VF 2 is compared to VF 4.
In practice, this seems to affect mostly load -> store type of loops on SystemZ (SPEC) (but if there would be a loop with some more operations on the narrow type, the win would be bigger). Therefore I am not so sure of the worth of this, but it seems reasonable to me. ~80 loops have slightly less instructions per iteration of original loop on z13, it seems.
Also not sure if it should 'continue' instead of taking the truncated type (given that the truncate should be present in the loop and therefore also checked).
I saw one loop that got slightly worse. It was load -> store loop, where the i16 loads were truncated to i8. Since two loads were stored with interleaving on the store, the vector store was now in 2 vector registers. This gave worse code on SystemZ for some reason, even though it should have just scaled, I think. Generally, either the target should return a higher cost, or fix the backend to reflect the cost, so that the lower VF is selected in this case.
I could make a test if the reviewers think the concept is a good. There are no test failures caused by this.
Call *I.user_begin(), or rather user_back(), once instead of thrice?
Checking isa<LoadInst> is somewhat redundant.
Taking the smaller T helps reduce MinWidth, but may also reduce MaxWidth.