Our actual lowering for i1 reductions uses ctpop combined with possibly a vector negate and possibly a logic op afterwards. I believe ctpop to be low cost on all reasonable hardware.
The default costing implementation here was returning quite inconsistent costs. and/or were returning very high costs (because we seem to think moving into scalar registers is very expensive?) and others were returning lower but still too high (because of the assumed tree reduce strategy). While we should probably improve the generic costing strategy for i1 vectors, let's start by fixing the immediate problem.
I really hope that max/min of i1 don't occur in normal scenarios.