This doesn't help the motivating cases in:
...yet, but I'd like to get feedback on the general approach.
The general idea is that if we have a legal vector pointer type, but we are bitcasting that pointer to only load a subset of a vector, then load the whole vector (if that is safe) and extract the subset of the vector.
This will allow SLP and/or instcombine to fold subsequent scalar ops together more easily because they will see extractelement ops from a single vector rather than incomplete parts of that vector.
Currently, this transform will make no overall difference to these most basic patterns because the backend (DAGCombiner) will narrow the loads back down to scalars via narrowExtractedVectorLoad().