D61068 handled vector shifts, this patch does the same for scalars where there are similar number of pipes for shifts as bit ops.
This is true almost entirely for AMD targets where the scalar ALUs are well balanced
This combine avoids AND immediate mask which usually means we reduce encoding size.
Some tests show use of (slow, scaled) LEA instead of SHL in some cases, but thats due to particular shift immediates - shift+mask generate these just as easily.