The old CPU model only had MLA->MLA forwarding. I added some missing
MUL->MLA read advances and a missing absolute diff accumulator read
advance according to the Cortex A57 Software Optimization Guide.
Also added missing schedules for ASIMD shift by immed basic, and
ASIMD shift by register, basic, D-form.
The patch improves performance in EEMBC benchmark and causes no
significant regressions (none in SPEC).