The normal scheme for tail folding reductions is to use:
loop: p = phi(0, a) mask = ... x = masked_load(..., mask) a = add(x, p) s = select(mask, a, p)
This means we need to keep the register p and a alive out of the loop, plus the mask. On a target with predicated operations we can instead generate the phi as p = phi(0, s). This ensures the select in the loop and we can fold select(m, add(a, b), c) to something like a vaddt c, a, b using the m predicate. This in turn allows us to tail predicate the entire loop.