We were previously prefering ZEXTLOAD over EXTLOAD if it is legal. This triggers during X86's promotion of i16->i32. Not sure about other targets.
Using ZEXTLOAD can prevent folding it to SEXTLOAD later if we were to promote a sign extended operand like we would need for SRA. However, X86 doesn't currently promote i16 SRA. I was looking into doing that which is how I found this issue.
This is also blocking our ability to fold 4 byte aligned EXTLOADs with "loadi32". This is what caused most of the test changes here.
There are couple tests where the instruction counts went up. I need to look closer at those.
These are likely to be performance regressions on most targets - the special case x86 'MulConstantOptimization' patterns in combineMul should still be used where possible.