I don't think there's a lot of room for variance in the implementation, but I'd like to have opinions on one particular point: the scalar expansion.
The expanded code for the intrinsics is the equivalent 2icmp+2select, roughly corresponding to:
(x > max ? max : (x < min ? min : x))
that is:
%0 = icmp slt i32 %x, %min %1 = select i1 %0, i32 %min, i32 %x %2 = icmp sgt i32 %1, %max %3 = select i1 %2, i32 %max, i32 %1
With, for signed n-bit saturation:
min: -(2^(n-1)) max: 2^(n-1)-1
However, an alternative expansion could be to not reuse the intermediate result, i.e.:
%0 = icmp slt i32 %x, %min %1 = select i1 %0, i32 %min, i32 %x %2 = icmp sgt i32 %x, %max ; <--- %x instead of %1 %3 = select i1 %2, i32 %max, i32 %1
I'm not entirely convinced which is better. In practice, the latter leads to increased register pressure, and actually resulted in some regressions, just for going through the intrinsic and back to the expansion.
The current patch uses the first one (with the dependency).
Anyway, this isn't a blocking problem for this patch, since the intrinsics aren't generated. We'll see about introducing more machinery (say a target hook to only generate the intrinsics when profitable, a register-pressure-aware legalization, or a later relaxation, etc..) if it's deemed necessary.
Based on D6976.