As suggested by Fabian on PR37441, use PSHUFLW to extend shift amount types for use with PSRAD/PSRLD to reduce register pressure.
Some of this ideally would be done by combineTargetShuffle but its tricky to do as most of the shuffles are sharing inputs.
Are these bitcasts here because you could fold them into this line? They are only needed on the non-AVX path from above right?