This is an archive of the discontinued LLVM Phabricator instance.

[LV] Scalarize operands of predicated instructions
ClosedPublic

Authored by mssimpso on Oct 28 2016, 9:00 AM.

Details

Summary

This patch attempts to scalarize the operand expressions of predicated instructions if they were conditionally executed in the original loop. After scalarization, the expressions will be sunk inside the blocks created for the predicated instructions. The transformation essentially performs un-if-conversion on the operands.

The cost model has been updated to determine if scalarization is profitable. It compares the cost of a vectorized instruction, assuming it will be if-converted, to the cost of the scalarized instruction, assuming that the instructions corresponding to each vector lane will be sunk inside a predicated block, possibly avoiding execution. If it's more profitable to scalarize the entire expression tree feeding the predicated instruction, the expression will be scalarized; otherwise, it will be vectorized. We only consider the cost of the entire expression to accurately estimate the cost of the required insertelement and extractelement instructions.

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 76204.Oct 28 2016, 9:00 AM
mssimpso retitled this revision from to [LV] Scalarize operands of predicated instructions.
mssimpso updated this object.
mssimpso added reviewers: anemet, mkuper, gilr.
mssimpso added subscribers: llvm-commits, mcrosier.
dorit added a subscriber: dorit.Oct 30 2016, 3:01 PM
gilr added inline comments.Oct 31 2016, 2:40 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6546 ↗(On Diff #76204)

It seems the only instructions pushed into Worklist are PredInst and instructions from its basic block. Is the invariance check necessary?

mkuper added inline comments.Oct 31 2016, 5:03 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
1916 ↗(On Diff #76204)

Can this actually happen, or should there be an assert here?

This seems weird, because if we sometimes call this before we collect the instructions for the VF, and sometimes after, then we'll get different results?

4678 ↗(On Diff #76204)

We have 5 places that call isScalarAfterVectorization().
Is this the only call site that cares about this? (If all of them should care, perhaps wrap in a helper function?)

mssimpso added inline comments.Nov 1 2016, 7:33 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
1916 ↗(On Diff #76204)

You're right, this can be an assert. I'll update the patch. Thanks!

4678 ↗(On Diff #76204)

You're asking if all places where we call isScalarAfterVectorization should also be concerned with isProfitableToScalarize? I think a helper makes sense in general, but this is the only case (at least currently) where I think it will make a difference.

In needsScalarInduction and widenIntInduction, the IVs shouldn't be predicated so it shouldn't make a difference. And in collectValuesToIgnore, we haven't yet computed the instruction costs.

6546 ↗(On Diff #76204)

Ah, that's right. Thanks! I think the invariance check is leftover from a previous implementation I had. I'll remove it.

mssimpso updated this revision to Diff 76631.Nov 1 2016, 2:29 PM
mssimpso marked 2 inline comments as done.

Addressed comments from Gil and Michael. Thanks!

  • Removed unnecessary invariance check from computePredInstDiscount
  • Added assert to isProfitableToScalarlize
  • Added call to collectInstsToScalarize for user-selected VFs (I realized we needed this with the new assert)
  • Made some auto types explicit
mkuper edited edge metadata.Nov 6 2016, 9:53 AM
  • Added call to collectInstsToScalarize for user-selected VFs (I realized we needed this with the new assert)

Yay, finally being pedantic and asking for an assert pays off! :-)

lib/Transforms/Vectorize/LoopVectorize.cpp
1962 ↗(On Diff #76631)

Maybe make more explicit that the cost is the cost of the instruction if scalarized? ("instruction-cost pairs for each choice of vectorization factor" seems to imply the vectorized cost.)

1968 ↗(On Diff #76631)

This can also be more explicit. "A negative returned value implies..."

6154 ↗(On Diff #76631)

Could you add a test for this?

6570 ↗(On Diff #76631)

"to to" -> "to"

6583 ↗(On Diff #76631)

Just trying to understand if I got this right.
Let's say we have:

%a = add i32, ...
%r = udiv i32 %x, %a
%s = udiv i32 %y, %a
%t = udiv i32 %z, %a

In the same block.
Will we do the right thing evaluating the cost of scalarizing the add in a separate context for each udiv?

6624 ↗(On Diff #76631)

Can we end up with a non-scalar type during the traversal?
I'm thinking about something like a div that depends on an extractvalue from a struct.

4678 ↗(On Diff #76204)

And in collectValuesToIgnore, we haven't yet computed the instruction costs.

So, we get some imprecision here, right? If we end up scalarizing things that aren't in VecValuesToIgnore then we'll overestimate the register pressure? Or am I confused?

Anyway, assuming I got it right - I'm not saying we need to fix this in this patch. The patch, as it is, is probably still a strict improvement. But a FIXME would be good.

gilr added inline comments.Nov 7 2016, 8:14 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6583 ↗(On Diff #76631)

Actually in such a case it seems the addition gets scalarized but cannot be sunk down because of the multiple uses.
I also wonder if we calculate the cost correctly when one predicated instruction feeds another.

6605 ↗(On Diff #76631)

The following loop causes the patch to assert

void foo(int* restrict a, int b, int* restrict c) {

for (int i = 0; i < 10000; ++i) {
  if (a[i] > 777) {
    int t2 = c[i];
    int t3 = t2 / b;
    a[i] += t3;
  }
}

}

while trying to scalarize a uniform value (the predicated load).
Got this example to work by skipping I if any of its operands isUniformAfterVectorization().

6624 ↗(On Diff #76631)

Wasn't the cost of insert-element accounted for by VectorCost?

4678 ↗(On Diff #76204)

I think Michael's comment raises a more general question: should it matter why we scalarize?
Put differently, once an instruction is marked for scalarization by legal or cost, shouldn't we treat it uniformly in the code?
[if so, then the question "will the instruction be scalarized" would only be well-defined in the context of some VF]

Michael/Gil,

Thanks for all the comments! I'll update a new version of the patch shortly.

lib/Transforms/Vectorize/LoopVectorize.cpp
1962 ↗(On Diff #76631)

Sounds good.

1968 ↗(On Diff #76631)

Sounds good.

6154 ↗(On Diff #76631)

Sure.

6570 ↗(On Diff #76631)

Thanks!

6583 ↗(On Diff #76631)

Yeah, Gil is right here in that the add in this example will not be sunk. Thanks, Gil! This is because our current predication method places each udiv in it's own separate block. Long term I think we will want to keep the predicated instructions in the same block after vectorization if they were in the same block before vectorization. I can add some comments about this in the meantime.

So we will need to check for the multi-context case, then. But this doesn't mean that we can't sink instructions with multi-uses. We should still be able to handle the following, for example.

if:
  %a = add i32, ...
  %r = add i32 %a, %a
  %s = udiv i32 %x, %r
  br

I'll add some tests.

6605 ↗(On Diff #76631)

Thanks, Gil!. I'll fix this in the updated patch and add an appropriate test.

6624 ↗(On Diff #76631)

For the cost of the insertelement for predicated instructions: that's right. We computed it in VectorCost. The insert overhead should be the same for both versions of the code: predication as-is vs. predication with scalarized operands. So we either need to subtract this overhead from VectorCost or add it to ScalarCost to make the calculations comparable.

For the non-scalar type question: this may be possible (I'll need to investigate). I'll add a test either way. Thanks!

4678 ↗(On Diff #76204)

That's right. There is some conservatism in collectValuesToIgnore. Currently, we place a value in VecValuesToIgnore only if we're sure (for all VFs) that the value will be scalar. But this doesn't mean that all other values will be vectorized. With this patch, we'll have a more precise determination of the known scalars after VF selection.

For the question about scalarization by legality vs. the cost model, for code generation, it's not going to matter why we decide to scalarize. In legality, we know that for any (legal) VF that we select, a value will be scalarized. In the cost model, we decide for particular VFs if it's more profitable to scalarize. When we come to code generation, we know what that VF is.

If the comment is that InnerLoopVectorizer shouldn't need to make a distinction between Legal and Cost scalarization, then we can create a helper function like Michael suggested for InnerLoopVectorizer to use. What do you think?

mkuper added inline comments.Nov 8 2016, 11:10 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
4678 ↗(On Diff #76204)

That sounds good to me.
I think eventually we'll probably need to keep a full "this will get scalarized" list per-VF instead of the current VecValuesToIgnore, but that's an issue for another patch.

(By the way, do we really expect to see cases where scalarization is better for some VFs, but worse for others?)

mssimpso added inline comments.Nov 8 2016, 11:40 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
4678 ↗(On Diff #76204)

At least for this patch, scalarization tends to become less profitable as the VF increases. This is primarily because, without PGO, we always assume a 0.5 predicated block probability.

mkuper added inline comments.Nov 8 2016, 1:34 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4678 ↗(On Diff #76204)

Ah, ok, makes sense.

mssimpso added inline comments.Nov 11 2016, 10:38 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6583 ↗(On Diff #76631)

Michael/Gil,

After thinking about the multi-context case a bit more (and not wanting to revisit this again later), I decided to take a stab at modifying code generation for the predicated instructions so that we don't place them in separate basic blocks if we don't have to. I'll upload this as a separate patch shortly.

Matt.

Michael/Gil,

I'm wrapping up addressing your comments, but I have a question about testing. Since this change is cost-model driven, all tests should really live under a target-specific directory (i.e., test/Transforms/LoopVectorize/AArch64). That's easy enough to do, but what should we do about all the target-independent tests that currently exist in the top-level directory? With this change, we would be introducing target-specific behavior that may change how those tests are vectorized, depending on the default target.

I can think of two ways to avoid this: (1) disable scalarization when a user-specified vectorization factor is provided. That is, only perform the optimization when the cost model is actually run. Or (2) introduce a new flag (e.g., --enable-scalarization-with-predication), defaulting to true that we could then disable in the run line of all the current top-level tests.

What do you think?

I'm not a fan of (1) - I don't think specifying the vectorization factor manually should mean that you're uninterested in all other cost-model considerations.

Regarding (2), I'm not sure what's worse - knob proliferation, or not having target-independent tests. That is, one knob is fine, but I wouldn't want to add a knob for every cost-model-dependent decision.
Maybe we can have a catch-all knob that basically says "use the default TTI instead of the current target's"? Does that even make sense?

I'm not a fan of (1) - I don't think specifying the vectorization factor manually should mean that you're uninterested in all other cost-model considerations.

Regarding (2), I'm not sure what's worse - knob proliferation, or not having target-independent tests. That is, one knob is fine, but I wouldn't want to add a knob for every cost-model-dependent decision.
Maybe we can have a catch-all knob that basically says "use the default TTI instead of the current target's"? Does that even make sense?

That should be doable. I'll give it a shot in a separate patch.

mssimpso updated this revision to Diff 79755.Nov 30 2016, 9:00 AM
mssimpso edited edge metadata.
mssimpso marked 9 inline comments as done.

Addressed comments from Michael and Gil.

I think I've taken care of everything mentioned until now. Thanks again for the reviews! With my testing concerns addressed, I think we're fine with the code generation tests remaining target-independent.

  • Added a helper function in InnerLoopVectorizer combining Legal->isScalarAfterVectorization with Cost->isProfitableToScalarize. I replaced all uses of isScalarAfterVectorization in the vectorizer with the new helper function. I added a FIXME for the use of isScalarAfterVectorization in collectValuesToIgnore, like Michael suggested.
  • Added an assert for Michael's non-scalar type question. Non-scalar types aren't allowable (we already check for this in canVectorizeInstrs), but I don't think an additional assert will hurt.
  • Added a test requested by Michael for user-specified vectorization factors with no interleaving (AArch64/aarch64-predication.ll). The test is target-specific to ensure we perform the same optimization with and without manually specifying the vectorization factor.
  • Reorganized scalarization conditions in canBeScalarized, and added a check for operands that are uniform-after-vectorization. This fixes the crash Gil discovered (X86/x86-predication.ll). The test is target-specific because the crash depends on the generation of masked loads.
  • Added a check in canBeScalarized for the multi-context case. We now only consider instructions forming a single-use chain from the original predicated block that would otherwise be vectorized. I added a test case for the multi-context costs (AArch64/predication_costs.ll).
  • Updated comments.

This LGTM, except some vague thoughts about slightly relaxing canBeScalarized().
If you think that's not a real concern, I'm ok with this going on as is.

lib/Transforms/Vectorize/LoopVectorize.cpp
6627 ↗(On Diff #79755)

Please add an explanation for why we bail on Legal->isScalarAfterVectorization(I)

6641 ↗(On Diff #79755)

Are you sure this is the right check?

If I understand correctly, we fail not because the pointer is uniform, but because it's only uniform in the "we use a single (LLVM) value to represent it" sense, not the "the (abstract) value is the same for all lanes" sense. Otherwise it'd be safe to scalarize.

I'm having a hard time to think of a good example, though because this is limited to instructions (so GV and param operands won't be affected), and uniform instructions tend to either be consecutive, loop-invariant (in which case they should be hoisted out by LICM before we hit the vectorizer), or uses of the scalar IV (in which case they won't be operands of vectorized instructions).

mssimpso added inline comments.Dec 1 2016, 9:25 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6627 ↗(On Diff #79755)

Sure. I don't think it's strictly necessary to bail here, but I couldn't think of an example where continuing to traverse the chain would be useful.

6641 ↗(On Diff #79755)

That's exactly right. We can't scalarize because we use a single value to represent the uniform-after-vec instructions. It's the difference between uniform meaning "only lane zero will be used so the others aren't needed" vs. "all lanes can be used but they will all be equal". I think we could change this if we wanted to. In getScalarValue, we currently assert if we're trying to access a uniform-after-vec instruction for a Lane > 0. We could instead clone the Lane == 0 value and return that. What do you think?

mkuper accepted this revision.Dec 1 2016, 3:25 PM
mkuper edited edge metadata.
mkuper added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
6641 ↗(On Diff #79755)

I'm really ok with this going in as is, as long as it's clearly documented.
We can fix this if ever run into a case where it matters, and I don't want to introduce even more complexity here if it's unwarranted.

This revision is now accepted and ready to land.Dec 1 2016, 3:25 PM
mssimpso added inline comments.Dec 2 2016, 4:49 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6641 ↗(On Diff #79755)

Sounds good, thanks Michael. I'll add some comments to make this very clear.

gilr added inline comments.Dec 2 2016, 7:56 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6688 ↗(On Diff #79755)

Is there scalarization overhead if canBeScalarized(J) is false due to isScalarAfterVectorization(J)?
More generally, shouldn't this line be under a !isScalarAfterVectorization(J) condition? We may have bailed out e.g. due to J being on a different basic block but J itself may be scalar (meaning we've "smoothed" a scalar-vector-scalar sequence). Is that correct?

mssimpso added inline comments.Dec 2 2016, 8:35 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6688 ↗(On Diff #79755)

That's right Gil. We also should check that J is actually contained in the loop before adding in the extract overhead. I had already added this to my working copy after uploading the last revision. But I'll upload a new revision with this change and Michael's comment suggestions for clarity. Thanks!

mssimpso updated this revision to Diff 80099.Dec 2 2016, 10:54 AM
mssimpso edited edge metadata.
mssimpso marked 3 inline comments as done.

Addressed comments from Michael and Gil.

  • Added documentation requested by Michael.
  • Added a needsExtract condition for determining if we need to compute a scalarization overhead, as mentioned by Gil. I updated some costs in the AArch64/predication_costs.ll test to reflect this. For predicated stores, we know their GEP pointer operands will be scalar, so we no longer will calculate an extract cost for them.

(Still LGTM :-) )

Did I address all of your comments, Gil? Thanks!

gilr accepted this revision.Dec 6 2016, 11:54 AM
gilr edited edge metadata.

Yes you did, Matt - sorry for the delay. LGTM too :)

In D26083#614898, @gilr wrote:

Yes you did, Matt - sorry for the delay. LGTM too :)

No problem - just wanted to make sure. Thanks for the review!

This revision was automatically updated to reflect the committed changes.