This is the reversal for the proposed IR canonicalization in D50992 (insert+vector op --> scalar op+insert).
I've enabled x86 in the minimal way because this looks like a close call for most recent Intel. They have fast (1uop / 1 cycle latency) transfer from GPR to *MM according to Agner / llvm-mca. And the current AVX2 code is likely too broadcast-happy as seen in the test diffs.
So this doesn't exercise the recent broadcast improvements from D51125 / D51186 yet.
Enabling more opcodes/types (the test file covers all 18 IR-equivalent binops) looks tricky. For example, we should have gotten the 'and' with i32 test, but v4i32 is promoted to v2i64. Pre-SSE4, we don't have pmulld, so we can't uniformly allow isLegalOrCustom().