This is an archive of the discontinued LLVM Phabricator instance.

[LV] Support predicated div/rem operations via safe-divisor select idiom
ClosedPublic

Authored by reames on Jul 20 2022, 6:10 AM.

Details

Summary

This patch adds support for vectorizing conditionally executed div/rem operations via a variant of widening. The existing support for predicated divrem in the vectorizer requires scalarization which we can't do for scalable vectors.

The basic idea is that we can always divide (take remainder) by 1 without executing UB. As such, we can use the active lane mask to conditional select either the actual divisor for active lanes, or a constant one for inactive lanes. We already account for the cost of the active lane mask, so the only additional cost is a splat of one and the vector select.

There are some potential further improvements code wise, but I don't think any of them are worth bothering with. The difference between vectorizing and not vectorizing is huge, the remaining minor wins are well - minor. Just for the record, here's the things I currently know of:

  • For the block predication, we're generating a select after the div/rem. This is arguably redundant with the select before the div/rem, and we could try to fold the later one out.
  • For targets with masked vector div/rem operations, we could make sure to pattern match the select. (Div/Rem by 1 is the same as using the undisturbed LHS.)

This is an area of the code I'm fairly new to, and I'm not quite clear on the desired design re: VPlan. Given that, I'm going to take extra care to layout the major design decisions so that reviews can tell me I did this wrong and how to rework it. :)

Strategy/Lowering wise, we've got a couple choices:

  • This approach which uses the active lane mask with the select.
  • Forming an independent safety mask by explicitly checking for UB triggering edge cases. This works, but seems to generate strictly worse code for the generic case. (This is actually where I started.)
  • Lowering to VP intrinsic. I'd prefer not to couple this to the VP work, and the select lowering is profitable on its own. Even for targets which don't have VL or mask predication on the udiv itself. Given that, we'd probably want a cost model based mechanism to expand the VP op back out anyways, and I'd prefer to think about that whole problem separately.
  • Lowering to a loop with internal predication. We could replace replication with a recipe which generates a sub-loop. The sub-loop would extract each lane, optional do the divide, and insert it back. Doing this would avoid potential concerns about speculation profitability, but appears to be a more invasive change to the vectorizer. I'm unconvinced this is worthwhile for this case.
  • We could add a udiv variant which does not fault. Most (all?) real vector hardware does not fault, so we could have a variant which directly represented this. This is a tricky design space, and is a larger topic than I really want to get into now.

Code structure wise, I see three major options:

  • Use a new recipe specific to div/rem.
  • Extend the existing VPWidenRecipe to handle the safe-divide guard if a mask is provided. DivRem is the only opcode which has this safety guard, so while doing this was straight forward, it seemed oddly coupled.
  • Use a new recipe for arbitrary predicated neutral element guards. This could be done for e.g. all binary ops, and is arguably a useful building block towards the VP patches. This feels a bit speculative to me, and possibly falsely generic.

I really don't have any preference as to the recipe structure, and will defer to reviewers with more context on the overall VPlan design and direction. Let me know what you want, and I'll adjust.

I enabled the new code only for scalable vectors. We could also legally enable it for fixed vectors as well, but I haven't thought through the cost tradeoffs between widening and scalarization enough to know if that's profitable. I'd welcome input on how this should be structured, but am also hoping to defer actually implementing it to a follow up patch. :)

Diff Detail

Event Timeline

reames created this revision.Jul 20 2022, 6:10 AM
reames requested review of this revision.Jul 20 2022, 6:10 AM
reames edited the summary of this revision. (Show Details)Jul 20 2022, 8:22 AM

Div/Rem by 1 is the same as using the undisturbed LHS.

That's true for Div, but Rem by 1 would produce 0.

What does SelectionDAG do with a scalable vector division if it isn't legal for the target?

What does SelectionDAG do with a scalable vector division if it isn't legal for the target?

Not sure, but this patch shouldn't result in that being created. Such a target is responsible for setting a reasonable cost. If that cost is invalid, existing logic should bailout of vectorization for scalable vector factors.

Div/Rem by 1 is the same as using the undisturbed LHS.

That's true for Div, but Rem by 1 would produce 0.

True, didn't think carefully enough. If we end up trying to implement the undisturbed case, we can explore using an alternate value for the safe-divisor for rem. Not thinking about it real hard, int_max/uint_max might be reasonable.

On targets with a vector "div" instruction, the instruction never actually traps or otherwise misbehaves; the only reason we have an issue here is that the IR "sdiv" is defined to be instant UB. Maybe we could consider introducing a new IR operation to model this? Maybe it doesn't really matter, though; division is expensive enough that the cost of an extra "select" is probably insignificant.

Not thinking about it real hard, int_max/uint_max might be reasonable.

That doesn't work for srem; the absolute value of int_min is larger than int_max.

On targets with a vector "div" instruction, the instruction never actually traps or otherwise misbehaves; the only reason we have an issue here is that the IR "sdiv" is defined to be instant UB. Maybe we could consider introducing a new IR operation to model this? Maybe it doesn't really matter, though; division is expensive enough that the cost of an extra "select" is probably insignificant.

I think we run the very real danger of letting perfection be the enemy of the good here. Once we can vectorize this at all, we can come back and explore better lowerings.

Sure, I didn't mean we should block this patch until we come up with the ideal solution. Just thought it was worth bringing up since you didn't mention it as a strategy.

reames edited the summary of this revision. (Show Details)Jul 20 2022, 11:50 AM

Sure, I didn't mean we should block this patch until we come up with the ideal solution. Just thought it was worth bringing up since you didn't mention it as a strategy.

Fair point, added it to the strategy list above.

A random thoughts on this, just for possible future reference: "does not fault" does not always mean "is always fast". I'm not sure of the details here, but I've run into cases like this in the fast where the non-faulting behavior was so poor performance wise you had to avoid it anyways. This would be something we'd need to check for when specifying the udiv variant.

Matt added a subscriber: Matt.Jul 27 2022, 1:40 PM
fhahn added inline comments.Jul 29 2022, 7:46 AM
llvm/lib/Transforms/Vectorize/VPlan.h
948 ↗(On Diff #446129)

Is a new recipe needed here? Would it be possible to generate a VPWidenSelectRecipe/ VPInstruction with an Instruction::Select opcode feeding the VPWidenRecipe for the div/rem instead?

reames updated this revision to Diff 448681.Jul 29 2022, 10:55 AM

Rework using VPInstruction as suggested by @fhahn.

I'm a bit unsure of the wiring for the VPInstruction. It appears to work, but I'm basically just copying code I don't understand. Careful review appreciated.

xbolva00 added inline comments.
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
697–698

Invalid?

reames added inline comments.Aug 15 2022, 9:42 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
697–698

The comment is now stale and needs removed. Is that what you meant by "Invalid"?

xbolva00 added inline comments.Aug 15 2022, 9:57 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
697–698

Yup, please update it

reames updated this revision to Diff 452750.Aug 15 2022, 11:20 AM

Address review comment and rebase

fhahn added inline comments.Aug 16 2022, 1:07 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8371

Do you know if there's an easy way to assert this here? Seems like something that could be easily missed.

reames added inline comments.Aug 16 2022, 6:57 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8371

I don't, and this a pretty prevalent assumption in the code already. e.g. all predicated store/load handling does the same.

If anything, I might lean towards removing the comment as it may falsely give the impression the assumption is unique to this code.

fhahn accepted this revision.Aug 23 2022, 12:43 PM
fhahn added reviewers: Ayal, gilr.

LGTM. thanks! I am also adding Ayal and Gil, in case they have additional comments, so it would be good to wait a day with committing so they have a chance to chime in.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8371

Fair enough!

This revision is now accepted and ready to land.Aug 23 2022, 12:43 PM
This revision was landed with ongoing or failed builds.Aug 24 2022, 10:08 AM
This revision was automatically updated to reflect the committed changes.