combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together.
This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356).
This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit.
My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great.
It would be good to add an example in the comment here or within the function. Something like:
trunc <8 x i32> X to <8 x i16> -->
MaskX = X & 0xffff (clear high bits to prevent saturation)
packus (extract_subv MaskX, 0), (extract_subv MaskX, 1)