This patch adds support for vectorizing conditionally executed div/rem operations via a variant of widening. The existing support for predicated divrem in the vectorizer requires scalarization which we can't do for scalable vectors.
The basic idea is that we can always divide (take remainder) by 1 without executing UB. As such, we can use the active lane mask to conditional select either the actual divisor for active lanes, or a constant one for inactive lanes. We already account for the cost of the active lane mask, so the only additional cost is a splat of one and the vector select.
There are some potential further improvements code wise, but I don't think any of them are worth bothering with. The difference between vectorizing and not vectorizing is huge, the remaining minor wins are well - minor. Just for the record, here's the things I currently know of:
- For the block predication, we're generating a select after the div/rem. This is arguably redundant with the select before the div/rem, and we could try to fold the later one out.
- For targets with masked vector div/rem operations, we could make sure to pattern match the select. (Div/Rem by 1 is the same as using the undisturbed LHS.)
This is an area of the code I'm fairly new to, and I'm not quite clear on the desired design re: VPlan. Given that, I'm going to take extra care to layout the major design decisions so that reviews can tell me I did this wrong and how to rework it. :)
Strategy/Lowering wise, we've got a couple choices:
- This approach which uses the active lane mask with the select.
- Forming an independent safety mask by explicitly checking for UB triggering edge cases. This works, but seems to generate strictly worse code for the generic case. (This is actually where I started.)
- Lowering to VP intrinsic. I'd prefer not to couple this to the VP work, and the select lowering is profitable on its own. Even for targets which don't have VL or mask predication on the udiv itself. Given that, we'd probably want a cost model based mechanism to expand the VP op back out anyways, and I'd prefer to think about that whole problem separately.
- Lowering to a loop with internal predication. We could replace replication with a recipe which generates a sub-loop. The sub-loop would extract each lane, optional do the divide, and insert it back. Doing this would avoid potential concerns about speculation profitability, but appears to be a more invasive change to the vectorizer. I'm unconvinced this is worthwhile for this case.
- We could add a udiv variant which does not fault. Most (all?) real vector hardware does not fault, so we could have a variant which directly represented this. This is a tricky design space, and is a larger topic than I really want to get into now.
Code structure wise, I see three major options:
- Use a new recipe specific to div/rem.
- Extend the existing VPWidenRecipe to handle the safe-divide guard if a mask is provided. DivRem is the only opcode which has this safety guard, so while doing this was straight forward, it seemed oddly coupled.
- Use a new recipe for arbitrary predicated neutral element guards. This could be done for e.g. all binary ops, and is arguably a useful building block towards the VP patches. This feels a bit speculative to me, and possibly falsely generic.
I really don't have any preference as to the recipe structure, and will defer to reviewers with more context on the overall VPlan design and direction. Let me know what you want, and I'll adjust.
I enabled the new code only for scalable vectors. We could also legally enable it for fixed vectors as well, but I haven't thought through the cost tradeoffs between widening and scalarization enough to know if that's profitable. I'd welcome input on how this should be structured, but am also hoping to defer actually implementing it to a follow up patch. :)
Do you know if there's an easy way to assert this here? Seems like something that could be easily missed.