Try to vectorize loops whose trip-count is smaller than TinyTripCountVectorThreshold under OptForSize constraint rather than not trying to vectorize them at all. The OptForSize constraint implies little if any overheads outside of the vectorized loop body, so the current cost estimate of the vectorized-vs-scalar loop body should hopefully be more/sufficiently accurate.
Also holds when the small value of the trip-count is based on profile data rather than static analysis, for potential cases where the trip-count is statically known to be divisible by the VF.
Patch inspired by D32451.