Page MenuHomePhabricator

[Loop Vectorizer] Support predication of div/rem

Authored by gilr on Jul 28 2016, 8:43 AM.



div/rem instructions in basic blocks that require predication currently prevent vectorization. This patch extends the existing mechanism for predicating stores to handle other instructions and leverages it to predicate divs and rems.

The generated vector extracts and inserts are now moved into the predicated block (reflected in the cost model for scalarization).

Diff Detail


Event Timeline

gilr updated this revision to Diff 65934.Jul 28 2016, 8:43 AM
gilr retitled this revision from to [Loop Vectorizer] Support predication of div/rem.
gilr updated this object.
gilr added reviewers: mkuper, jmolloy.
gilr added subscribers: llvm-commits, Ayal, delena.
anemet added a subscriber: anemet.Jul 28 2016, 9:58 AM
anemet added inline comments.
3825–3834 ↗(On Diff #65934)

Only a drive-by comment, please don't make vectorizeLoop any more unreadable than it already is. Please consider prequel to this patch that moves store-predication into its own function.

You might want to consider special-casing division by a constant integer. For example, on x86, we can convert a 16-bit unsigned divide by a constant into a pmulhuw+psrlw.

mkuper added inline comments.Jul 28 2016, 3:50 PM
3327 ↗(On Diff #65934)

I'm not entirely sure what "the cost of a phi" means, especially without a type.

Also, I don't believe any in-tree target actually assigns a non-zero to PHIs right now. So if you actually want a meaningful cost here for, say, x86, you may want to look into the cost model as well...

3825 ↗(On Diff #65934)

...that need predication?
(Before, we really would predicate *any* store)

3825–3834 ↗(On Diff #65934)


3833 ↗(On Diff #65934)

We don't run anything that will sink this later in the pipeline?
(To be honest, even if we do, I'm not sure whether we should do the cleanup here or rely on a later pass, but I'm curious)

3835 ↗(On Diff #65934)

Any reason not to use a range for over the operands?

3841 ↗(On Diff #65934)

Do you know if we have an isOnlyUserOf helper?
I know we have one for SDNode... but, I couldn't find one in IR.

3859 ↗(On Diff #65934)

I->hasOneUse()? Or do you care specifically about the users() list? In any case, no need to compute std::distance.

3860 ↗(On Diff #65934)


4251 ↗(On Diff #65934)

FRem looks out of place. Did you mean URem?

4254 ↗(On Diff #65934)

To expand on what Eli said, if we have division by a non-zero constant, then:

  1. It may be efficiently lowered.
  2. Since the constant is non-zero, it doesn't need predication.
4264 ↗(On Diff #65934)

And, correspondingly, this should probably be FRem.

5070 ↗(On Diff #65934)

As long as you're touching this - remove this comment?

gilr added inline comments.Jul 31 2016, 8:19 AM
3327 ↗(On Diff #65934)

So IIRC I took it from CostModelAnalysis::getInstructionCost(), but I'll remove it if it makes no sense.

3833 ↗(On Diff #65934)

Actually the vectorizer currently seems to expect inst-combine to do so (see the deleted VEC-IC case in if-pred-stores.ll), but IINM the cost model didn't reflect that.
I agree, there's the general issue of generating efficient code here vs relying on later passes to clean up, which I think this was also brought up here.

3835 ↗(On Diff #65934)

No, will fix.

3841 ↗(On Diff #65934)

Me either - anyone?

3859 ↗(On Diff #65934)

Right, will fix.

3860 ↗(On Diff #65934)

Right, will fix.

4251 ↗(On Diff #65934)


4264 ↗(On Diff #65934)


mkuper added inline comments.Aug 1 2016, 2:54 PM
3327 ↗(On Diff #65934)

I'm not sure what makes sense here, to be honest.
What cost, in the final generated code, are you trying to account for? An extra register copy?

4251 ↗(On Diff #65934)

Too bad there wasn't a test that would have failed because of this. ;-)

gilr updated this revision to Diff 66474.Aug 2 2016, 7:57 AM

Implemented several reviewer comments, notably avoiding predication when dividing by a non-zero constant.

gilr updated this revision to Diff 66483.Aug 2 2016, 8:54 AM

Added may-divide-by-zero logic to cost model (was missing in previous patch); moved logic to its own helper function.

More drive-by comments.

3835–3844 ↗(On Diff #66483)

Would be good to describe the high-level strategy of how we predicate these instruction, perhaps with an IR example.

3860–3861 ↗(On Diff #66483)

I think that Twine knows how to concatenate string-like things. You only need the explicit ctor on the first one.

18–51 ↗(On Diff #66483)

I *think* we're pretty consistent about using uppercase for the named regexes. That helps readability.

gilr added inline comments.Aug 4 2016, 3:56 AM
3331 ↗(On Diff #66483)

I actually wasn't trying to model any specific cost in the generated code, just trying to be consistent about accounting for every generated instruction at IR level (and letting TTI decide their cost).
So if PHIs have zero cost by definition/convention and should not be taken into account in cost models then I should just remove this. Otherwise, placing the call now makes sure we don't miss that cost if targets start modelling it. What say you?

3860–3861 ↗(On Diff #66483)

Right, will fix.

mkuper added inline comments.Aug 4 2016, 9:28 AM
3331 ↗(On Diff #66483)

We're not even consistent about it in our different cost models - CostModelAnalysis::getInstructionCost() calls getCFInstrCost(), while (the admittedly, old) BBVectorizer cost model just reutrns 0.
But I get what you're saying. If you prefer to leave it, leave it, but I think it'd be nice to document the fact you don't currently expect a real cost here.

gilr updated this revision to Diff 66949.Aug 5 2016, 8:06 AM
  • Merged with prequel patch r277595 (D23013).
  • Changed named registers in new lit test to uppercase.
  • Documented predication logic
  • Removed unnecessary Twine ctor calls.
mkuper accepted this revision.Aug 5 2016, 10:14 AM
mkuper edited edge metadata.

LGTM, but please wait a bit for anemet, in case he also wants to review this in non-drive-by-mode. :-)

This revision is now accepted and ready to land.Aug 5 2016, 10:14 AM
anemet added inline comments.Aug 9 2016, 12:01 AM
3385–3386 ↗(On Diff #66949)

There is something wrong with this sentence.

3388 ↗(On Diff #66949)


Same in the other functions.

3416–3417 ↗(On Diff #66949)

Explain in the comment how this guys is different from the previous one.

3431–3432 ↗(On Diff #66949)

Same here.

3433 ↗(On Diff #66949)

Type* -> Type *

Did you run clang-format on the diff?

4105 ↗(On Diff #66949)

auto *

4108–4115 ↗(On Diff #66949)

Is this any different than OpInst->hasOneUse?

4135–4138 ↗(On Diff #66949)

Is there a test for the non-insertelt case?

4138 ↗(On Diff #66949)

Why is the undef correct here?

4347 ↗(On Diff #66949)

auto *

gilr added inline comments.Aug 10 2016, 3:53 AM
4108–4115 ↗(On Diff #66949)

Yes, for Instructions that use the same value more than once (see Michael's comment).

4135–4138 ↗(On Diff #66949)

No. We currently always create an insertelement on scalarization. I added support for the non-insertelement case for completeness, since predication is done separately from scalarization.
IIUC this case will need to be supported if & when Matthew's patch is committed, but for now it's really FFU.
I'll replace this case with an assertion for this patch and leave it to Matthew to resurrect in his patch as needed.

4138 ↗(On Diff #66949)

We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either

  • a select that will blend it out, or
  • a Use dominated by this instruction's BB that is either
    • predicated by (at least) the same predicate, which won't use the undef, or
    • not predicated due to no side effects, where undef would be safe
mssimpso added inline comments.Aug 10 2016, 5:06 AM
4135–4138 ↗(On Diff #66949)


For the non-insert cases that might arise with my patch, I think it would be better to leave the case here (don't add an assert), but include a test with this patch that would break with the other patch. Does that make sense? That will keep the patches better self-contained and prevent us from having to revisit this.

You will need a test where the div only feeds an instruction that will also be scalar (like a GEP). If this patch lands before the other one, it will be cleaner if in the other we just have to change the test. Presumably, that would involve replacing the PHI for the inserts with a PHI for the predicated instruction, since the inserts will no longer be there? If the other patch lands first, there will be no issue since you'll already have a test for the non-insert case.

Ayal added inline comments.Aug 10 2016, 6:33 AM
3400 ↗(On Diff #66949)

Suggest to first add the cost of the InsertElement and then the (optional) cost of the Phi, just to keep things in the same order they will be executed. Furthermore, cost if Extract should precede that of Insert.

4346–4347 ↗(On Diff #66949)

Suggest to rename Op2 to Divisor.

It may be worthwhile to generalize and check isKnownNonZero(). Non-zero divisors that are not compile-time constants will not be converted into multiplication, so we will still end up scalarizing the division, but can do so w/o predication.

Also, store predication is currently behind a flag that defaults to false (-enable-cond-stores-vec). Since you're reusing the store predication logic, I'm wondering if the mayDivideByZero cases should be under the flag as well. What do you think?


Also, store predication is currently behind a flag that defaults to false (-enable-cond-stores-vec). Since you're reusing the store predication logic, I'm wondering if the mayDivideByZero cases should be under the flag as well. What do you think?


The reason enable-cond-stores-vec defaults to false is the lack of cost modeling for predicated stores.
The change to getScalarizationOverhead() is supposed to solve this for divs. But I guess it depends on the real-world performance impact this patch has.

gilr updated this revision to Diff 67673.Aug 11 2016, 5:12 AM
gilr edited edge metadata.
  • Per Matt's comment: code continues to support the non-insertelement case; added an FFU test that should fail once the vectorizer supports direct scalar-scalar use.
  • Implemented (hopefully) all other review comments
  • Ran clang-format again
mssimpso added inline comments.Aug 11 2016, 2:26 PM
136–138 ↗(On Diff #67673)

Thanks for adding the additional "future" test. I don't think it will exercise the non-insert case, though. I'm very sorry for not being more clear previously. Here, %rsd will always have to be inserted into a vector since it will be directly used by a select instruction, which will remain vectorized. I didn't think of this when I last commented. But I think if you add an additional instruction, this should produce the desired effect. Something like:

  %tmp = sdiv i32 %psd, %lsd
  %rsd = sdiv i32 %tmp, %lsd
  br label %if.end

When I ran the modified test with this patch and the scalar patch, the non-insert case was used for %tmp and the insert case was used for %rsd. This makes sense becase %tmp is only used by %rsd (will be scalar), and %rsd will again feed the vector select.

gilr added inline comments.Aug 12 2016, 2:01 PM
136–138 ↗(On Diff #67673)

Argh, sorry about that. Your explanation was clear - just a hasty implementation on my side :(
Yes, the second sdiv should go under the same condition - will fix.

gilr updated this revision to Diff 67972.Aug 14 2016, 2:22 AM

Fixed FFU test case.

mssimpso added inline comments.Aug 15 2016, 11:39 AM
137–139 ↗(On Diff #67972)

The test looks good to me now. Thanks!

gilr added a comment.Aug 16 2016, 8:48 AM


Do you have any other comments? Am I good to go with this change?

Gil, I will look at this today.

anemet added inline comments.Aug 17 2016, 11:58 AM
4070–4096 ↗(On Diff #67972)

I would also include the select instruction in this excerpt (omitting anything in between with a ...). Then you can explain in the initial comment that the value produced on the false branch is not used (the conditional execution is only reintroduced to avoid side-effects).

4076 ↗(On Diff #67972)

"So for the first element of a scalarized instruction, e.g."

4111–4118 ↗(On Diff #67972)

OK, is this difference relevant? If yes, add a helper either in this module or at a more global place and use it.

You can probably also writes this with std::all_of or something.

4141 ↗(On Diff #67972)

We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either

Makes sense but we need a comment for this. I think the best is to explain this on the excerpt at the beginning, see my comment there.

8–9 ↗(On Diff #67972)

As a demo for how this works it would be actually good to include at least one of the second element sequences as well.

20–21 ↗(On Diff #67972)

Can you please name and match this extract as well, it helps reading. Everywhere in these tests.

22–23 ↗(On Diff #67972)

It would be also good to check the extractelements feeding the divs here.

gilr added inline comments.Aug 18 2016, 2:32 PM
4111–4118 ↗(On Diff #67972)

It shouldn't make a difference for the currently predicated instructions. I'll replace with hasOneUse with a comment to capture the possible (conservative) inaccuracy.

gilr updated this revision to Diff 68616.Aug 18 2016, 2:54 PM

Implemented Adam's comments.

anemet accepted this revision.Aug 23 2016, 2:21 PM
anemet added a reviewer: anemet.

LGTM too with the comments addressed below. Thanks!

4079 ↗(On Diff #68616)

Please remove the '; pred =' comments

You may want to add a

// ...

after the for.body: label in all these loops. That is where the div operand would be loaded, etc.

4091 ↗(On Diff #68616)

s/selected-out/if-converted using a select/

4101–4102 ↗(On Diff #68616)

%33 and %34 are not used, please remove

This revision was automatically updated to reflect the committed changes.