Related to https://bugs.llvm.org/show_bug.cgi?id=40123.
Rather than scalarizing, expand a vector USUBSAT into UMAX+SUB, which produces much better code for X86. I've updated the cost tables on the assumption that pmaxud has a cost of 1, not sure if that's correct. Agner lists it as latency 1, and recip throughput 0.5 or 1 depending on uarch.