This expands upon D75069, allowing them to be inserted into tail folded loops. Reductions are generates with the form:
x = select(mask, vecop, zero) v = vecreduce.add(x)
Where zero here is chosen as the identity value for add reductions. The backend is then expected to fold the select and the vecreduce into a single predicated instruction.
Most of the code is fairly straight forward, except for the creation of blockmasks which need to ensure they are created in dominance order.
nit: perhaps a comment about the in-loop reductions.