This is a fix for PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974
I didn't see any existing TLI hooks that capture what we need to know if this is profitable, so I'm proposing a new hook that includes the source and destination types of the cast op. This is enabled for x86 only here, but any target that wants to avoid a register file back-and-forth may find this useful.
The known bits diffs suggest that we can do better at simplifying based on vector demanded elements, but I'm assuming those are not the typical patterns.
We would also likely improve things by moving shuffles ahead of the cast in the case where we are not extracting from element 0.
Looks bad