The current implementation tries to handle the high and low halves separately, but that's less efficient in most cases; use a wide SETCC instead.
Still some small regressions scattered across the testcases... the most concerning are AMDGPU and VE, where apparently this actually makes things worse somehow in the general case. Do we need a target hook, or is there some way to use a unified codepath?
Should do the RHS check first