Model this function more closely after the BasicTTIImpl version, with
separate handling of loads and stores. For loads, the set of actually loaded
vectors is checked.
This makes it more readable and just slightly more accurate generally. I think this is also a good starting point for any future further improvements.
Note: this should wait until the fix for scalarized loads has gone in, since I saw some loop that now got interleaved loads instead of scalarized, which seemed wrong in that case.
I think "VecTy" here is already the wide vector type, containing VF * Factor many elements. Why do you need to multiply it size *again* with VF here?