Instructions identified as "scalar with predication" will be "vectorized" using a replicating region. If such instructions are also optimized as "uniform after vectorization", namely when only the first of VF lanes is used, such a replicating region becomes erroneous - only the first instance of the region can and should be formed. Fix such cases by not considering such instructions as "uniform after vectorization".
A TODO is left as such cases could be optimized by implementing single instance regions, but noting that such cases are rare. The specific case of PR40816 should be optimized by not vectorizing such instructions at all but instead recognizing them as DeadInstructions, or employing indvars to rid them before LV as discussed in https://reviews.llvm.org/D68577#1742745.
The added test case is a simplification of the original one reported in the PR.
LLVM_DEBUG is redundant under #ifndef NDEBUG, will remove.