AH, SH and MH costs are already covered in the cases where LHS is 32 bits and RHS is 16 bits of memory sign-extended to i32.
As these instructions are also used when LHS is i16, this patch handles this case also by recognizing that the loads in those cases also get folded.
This is NFC on SPEC, but silently affects the scalar loop cost estimates (in LoopVectorizer) of 26 times.
I'm not 100% sure about the implications of LHS being just 16 bits, but this seems to at least match what CodeGen is doing.