We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs. We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:
small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))
Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.
I checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).
I have adjusted some cost model values for this but I am not sure if I have done that right. The given costs don't look very consistent to me. For example a conversion from 8 floats to 8 uint8/int8 gives me a cost of 7 although fewer instructions are generated and latencywise the dependency chain is way shorter. v4i32 to v4f64 is even more extreme with a cost of 16 although the generated code is nice and simple. I have set the new cost for the conversion of this patch to 8 based on what was set previously and based on the latency of the longest dependency chain (The explanation on top of the file says it should be that). Additionally type legalization isn't done before looking up the cost tables so I had to add multiple entries for different vector widths which seems redundant.