Page MenuHomePhabricator

[LV] Fix code gen for conditionally executed uniform loads
ClosedPublic

Authored by anna on Aug 27 2018, 9:19 AM.

Details

Summary

While working on D50665, I came across a latent bug in vectorizer which generates
incorrect code for uniform memory accesses that are executed conditionally.

This affects architectures that have masked gather/scatter support.
See added test case in X86. Without this patch, we were unconditionally
executing the load in the vectorized version. This can introduce a SEGFAULT
which never occurs in the scalar version.

The fix here is to avoid scalarizing of uniform loads that are executed
conditionally. On architectures with masked gather support, these loads
should use the masked gather instruction.

Diff Detail

Repository
rL LLVM

Event Timeline

anna created this revision.Aug 27 2018, 9:19 AM
delena removed a reviewer: delena.Aug 27 2018, 11:50 AM

This affects architectures that have masked gather/scatter support.

I moved that check from legal to cost model (D43208). As a result, the same issue should happen for arch w/o masked gather/scatter support, probably less often. If you can make the LIT test generic (i.e., not x86 specific), that would be better (but not a must-fix).

We should introduce something like CM_PredicatedUniform and then generate code as "if (mask is not all zero){ load and bcast }" (and that should make the FileCheck validation much simpler), but stop bleeding first approach is fine.

Nice small fix. LGTM.

hsaito accepted this revision.Aug 27 2018, 2:23 PM

LGTM

This revision is now accepted and ready to land.Aug 27 2018, 2:23 PM
Ayal added inline comments.Aug 27 2018, 4:41 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5883 ↗(On Diff #162692)

Indeed scalarizing a conditional load from a uniform address is currently broken, when it can turn into a gather. The culprit is that VPReplicateRecipe, when constructed for such instructions, uses an incorrect IsPredicate = isScalarWithPredication(). Which, as noted, is false for loads and stores of types that can be gathered/scattered. I.e., really means mustBeScalarWithPredication. (That, on top of IsUniform which is also incorrect, causing replication as noted in D50665#1209958.)

This proposed fix to refrain from scalarizing such loads, should make sure they always turn into a gather, rather than rely on the cost-based gather-vs-scalarize decision below.

The above holds also for conditional loads from non-uniform addresses, that can turn into gathers, but possibly also get incorrectly scalarized w/o branches. It's hard to test, as the scalarization cost needs to be better than the gather for this to occur. But best make sure these also always turn into a gather.

It would also/alternatively be good to fix the scalarization of such loads (by correcting IsPredicate), and enable cost-based gather-vs-scalarize decision. Presumably gather should always win, unless perhaps if the load feed scalarized users.

This affects architectures that have masked gather/scatter support.

I moved that check from legal to cost model (D43208). As a result, the same issue should happen for arch w/o masked gather/scatter support, probably less often. If you can make the LIT test generic (i.e., not x86 specific), that would be better (but not a must-fix).

Taking this back. The trigger requires "isLegalMask*()" returning true.

Ayal added a comment.Aug 28 2018, 6:42 AM

The above holds also for conditional loads from non-uniform addresses, that can turn into gathers, but possibly also get incorrectly scalarized w/o branches.

Ahh, the above holds also for conditional stores from non-consecutive/interleaved addresses, that can turn into scatters, but possibly also get incorrectly scalarized w/o branches. Good catch indeed!

anna added inline comments.Aug 28 2018, 7:02 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5877 ↗(On Diff #162692)

Also this seems inaccurate - which was how I found the bug in the uniform store handling.
NumPredStores is really both scatter and (branch + store) predicated stores. A better name would be NumScalarPredStores

5883 ↗(On Diff #162692)

It seems to me that we need a new function LoopVectorizationCostModel::isPredicated which is a superset of instructions contained by LoopVectorizationCostModel::isScalarWithPredication?
So, the 'complete' fix here is to correct the costmodel as done above *and* fix isPredicate in VPRecipeBuilder::handleReplication:

bool IsPredicated = CM.isPredicated(I);
auto *Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated);

Essentially, whether we use a gather/scatter or we do plain old branch + load/store, (because we don't have gather/scatter support), both are predicated - one is a 'vector predicate', other is scalarized + predicated.

Ayal added inline comments.Aug 28 2018, 3:07 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5883 ↗(On Diff #162692)

As for a 'complete' fix...
The ten calls to isScalarWithPredication() don't all ask the same thing; it mostly depends on whether setWideningDecision() already took place or not; i.e., the decision if a load/store will be CM_Scalarize (with or w/o predication) or be a CM_GatherScatter. Those called after setWideningDecision() depend on the widening decision set, which depends on the VF, essentially asking if willBeScalarWithPredication(); i.e., excluding CM_GatherScatter. They are the majority, so best have them continue to use isScalarWithPredication(), passing it VF. Three are called before, essentially asking distinct things - here's a breakdown:

  1. memInstCanBeWidened(): calls before the decision, in order to make it. Already knows it's a Load/Store. Needs to check if blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened".
  2. useEmulatedMaskMemRefHack(): calls once before the decision and again after it. But only to assert(isScalarWithPredication()), so need to carefully assert only what's valid in both contexts.
  3. getMemInstScalarizationCost: checks what the cost would be if the Load/Store would be scalarized. Needs to check if blockNeedsPredication && isMaskRequired. Note that both isLegalMaskedLoad/Store and isLegalMaskedGather/Scatter are irrelevant.

Of the calls that take place after the decision, note that:

  1. setCostBasedWideningDecision: bumping NumPredStores++ should best be postponed to after the (CM_Scalarize) decision is reached, at the bottom.
  2. handleReplication: should call isScalarWithPredication(I, VF) from within getDecisionAndClampRange(), as done with IsUniform.
  3. tryToWiden: should call !isScalarWithPredication(I, VF) from within getDecisionAndClampRange().

Devising tests would be challenging, e.g., making gather/scatter decisions for some VF's yet scalarization for others.

anna added inline comments.Aug 29 2018, 8:22 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5883 ↗(On Diff #162692)

Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here.

My small fix here is the uniform accesses should be handled correctly for the load case irrespective of the VF and VPlan. (and then it will be extended for the store as the fix for D50665), but as you pointed out it only adjusts the cost model. The actual vectorization should also be fixed so that a gather is guaranteed to remain a gather.

From your explanation and reading the code, my understanding is that the right fix for the uniform accesses is:

  1. what I've currently got here in setCostBasedWideningDecision to avoid CM_Scalarize for predicated blocks
  2. Fix memoryInstructionCanBeWidened use of isScalarWithPredication
  3. Fix for getMemInstScalarizationCost use of isScalarWithPredication so that we will correctly choose the decision.
  4. Implement isScalarWithPredication(I, VF) where we retrive the widening_decision for the VF and make sure it's NOT CM_gatherscatter.
  5. handleReplication calls isScalarWithPredication(I, VF)
  6. tryToWiden: should call !isScalarWithPredication(I, VF)
anna updated this revision to Diff 163148.Aug 29 2018, 11:23 AM

addressed review comments - we make sure that the vectorization also uses the cost decision of gather/scatter
instead of scalarizing.
Also, handles the original bug of generating incorrect code for conditional uniform loads.

anna added a comment.EditedAug 29 2018, 11:48 AM

To state what's fixed in latest patch:

  1. Uniform conditional loads on architectures with gather support will have correct code generated. In particular, the cost model (setCostBasedWideningDecision) is fixed. The codeGen for replication and widening recipes do the respective operations ONLY for widening decision != CM_GatherScatter. 1.1 For the recipes which are handled after the widening decision is set, we use the isScalarWithPredication(I, VF) form which is added in the patch.
  1. Fix the vectorization cost model for scalarization (getMemInstScalarizationCost): implement and use isPredicatedInst to identify *all* predicated instructions, not just scalar+predicated. So, now the cost for scalarization will be increased for maskedloads/stores and gather/scatter operations. In short, we should be choosing the gather/scatter in place of scalarization on archs where it is profitable. In archs, where we don't have support for gather/scatter, we may no longer vectorize because scalarized and predicated loads/stores have high cost.
  1. Given above 2, needed to weaken the assert in useEmulatedMaskMemRefHack.
anna requested review of this revision.Aug 30 2018, 7:57 AM
hsaito added inline comments.Sep 4 2018, 6:48 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
1433 ↗(On Diff #163148)

Maybe a nit picking, but I think the logic is clearer if we write this code as in

if (load_or_store) {

return Legal->isMaskRequired(I)

}
return isScalarWithPredication(I)

4404 ↗(On Diff #163148)

If "WideningDecision == CM_Scalarize" doesn't work here, we need comments on why that won't work.
Else, change the condition to more straightforward "WideningDecition == CM_Scalarize" to avoid bugs now or later.

6826 ↗(On Diff #163148)

Same as Line 4404.

anna marked 2 inline comments as done.Sep 5 2018, 12:23 PM
anna added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
6826 ↗(On Diff #163148)

I believe it does not apply here. We specifically *need* to check for widening decisions. At this point, the decision can be CM_GATHERSCATTER or a widening decision (if we hadn't called tryToWidenMemory and fail this assert). We will never get CM_Scalarize here because of the check at line 6747.

So, I think this assert is pretty clear in that aspect.

anna updated this revision to Diff 164101.Sep 5 2018, 12:26 PM

addressed review comments.

hsaito added inline comments.Sep 5 2018, 3:35 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6826 ↗(On Diff #163148)

In non-NativeVPlan mode, tryToWidenMemory() is the only place that creates VPWidenMemoryRecipe(), which handles Widen, Widen_Reverse, and GatherScatter. If we intend to emit vector loads/stores or gathers/scatters, that needs to be caught there.

Line 6747 filters out only the predicated scalars. So, unmasked scalar loads/stores would hit here.

So, in my opinion, if CM_GatherScatter hits here, that means "cost model wants to use gather/scatter but we'll serialize". That is an error condition, isn't it?

5754 ↗(On Diff #164101)

Just a note. No actionable items here.

Where the code modification matters, the code before the change ends up comparing masked gather/scatter (we shouldn't see masked widened load/store in the called context) against unmasked scalarized load/store. So this change makes sense. As a result, the modified assert condition in useEmulatedMaskMemRefHack() makes sense.

Just be aware that we may end up losing some vectorization as a result, but those are likely to be the places where we are generating wrong code today: masked gather/scatter is available but cost model decides to scalarize. This scalarized code isn't properly masked --- because masking there is gated by masked gather/scatter being unavailable.

anna added inline comments.Sep 6 2018, 10:49 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6826 ↗(On Diff #163148)

Line 6747 filters out only the predicated scalars. So, unmasked scalar loads/stores would hit here.

agreed. that's right. So we can hit CM_Scalarize here.

So, in my opinion, if CM_GatherScatter hits here, that means "cost model wants to use gather/scatter but we'll serialize". That is an error condition, isn't it?

Actually, that doesn't agree with the error message though: "Memory widening decisions should have been taken care by now".

Also, even if we hit CM_GatherScatter here, we will not widen because willWiden returns false.

Basically, this function tryToWiden does not handle loads and stores of any kind: by this point, we should finish CM_Widen, Widen_reverse, GatherScatter and CM_Interleave.

So, I think you mean that we should leave the assert as-is because only non-predicated loads and stores of CM_Scalarize are allowed to reach here?
Thanks for clearing this.

Also, the error message is really unclear in that aspect. It's not that widening isn't allowed here - none of the other widening decisions are allowed. It should state something like "only unmasked loads and stores allowed here"

anna updated this revision to Diff 164264.Sep 6 2018, 12:06 PM

addressed review comment.

hsaito accepted this revision.Sep 6 2018, 12:23 PM

LGTM

lib/Transforms/Vectorize/LoopVectorize.cpp
6826 ↗(On Diff #163148)

It should state something like "only unmasked loads and stores allowed here"

I agree. That would be clearer.

This revision is now accepted and ready to land.Sep 6 2018, 12:23 PM
This revision was automatically updated to reflect the committed changes.
Ayal added a comment.Sep 7 2018, 9:31 AM

(post commit review)

lib/Transforms/Vectorize/LoopVectorize.cpp
5883 ↗(On Diff #162692)

Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here.

OK, sure, understood (and thanks BTW for helping take care of PR38786 :-). This was in response to your "So, the 'complete' fix here is to ..."

OTOH, looks like this patch does take care of nearly everything listed above ... except for (1) memInstCanBeWidened() and (1) bumping NumPredStores after we've made the decision - which raises a subtle dependence issue as useEmulatedMaskMemRefHack() uses NumPredStores. And making sure VF is passed to isScalarWithPredication() wherever it should.

So out of the ten calls to isScalarWithPredication(), two are redirected here to isPredicated() - the assert in useEmulatedMaskMemRefHack() and the call in getMemInstScalarizationCost(), effectively taking care of (2) and (3) above. Note that both evidently deal with Mem loads/stores. As an alternative, collectInstsToScalarize() could first check if I is a load/store before it calls useEmulatedMaskMemRefHack(&I). This way, isPredicate(I) will always be used for loads/stores, and could be folded into an if (blockNeedsPredication && isMaskRequired), as noted above for (3).

This proposed fix to refrain from scalarizing such loads, should make sure they always turn into a gather, rather than rely on the cost-based gather-vs-scalarize decision below.

So the fix here allows conditional loads from uniform addresses to eventually turn into gathers or be scalarized, according to cost-model decision, right? No enforcement to scalarize appears below.

Wonder if the above addition of !BlockNeedsPredication is still needed(?)

1424 ↗(On Diff #164264)

"non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar instructions (e.g., possibly unrolled by UF). Here, iiuc, VF>1 corresponds to asking willBeScalarWithPredication() - after per-VF decisions have been taken, whereas VF=1 corresponds to asking mustBeScalarWithPredication() - before per-VF decisions have been taken; as called by the assert of useEmulatedMaskMemRefHack() in one context. But tryToWiden() and handleReplication() may also pass VF=1, and they call after decisions have been made.

Best use isScalarWithPredication() always after per-VF decisions have been made. In any case, better have each caller specify the VF explicitly, to make sure it's passed whenever available; e.g., see below.

"Such instructions include" also conditional loads, with this patch.

1428 ↗(On Diff #164264)

Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where other VF's gather/scatter, but they all predicate, or none do.

As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1).

4407 ↗(On Diff #164264)

When called after having made per-VF decisions, having VF=1 simply says it's scalar, rather than checking mustBeScalar here.

5519 ↗(On Diff #164264)

isScalarWithPredication(&I, VF)

anna added inline comments.Sep 10 2018, 7:07 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5883 ↗(On Diff #162692)

So the fix here allows conditional loads from uniform addresses to eventually turn into gathers or be scalarized, according to cost-model decision, right?

Yes, it is according to the cost model decision that is made later on in this function itself.

Wonder if the above addition of !BlockNeedsPredication is still needed(?)

I think we still need it. Otherwise, we're not giving the cost model a chance to choose between scalarize versus gather-scatter and gather-scatter maybe more profitable in AVX-512 for example. (See continue at line 5909)

1424 ↗(On Diff #164264)

"non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar instructions (e.g., possibly unrolled by UF).

oops. typo, will fix the comment.

1428 ↗(On Diff #164264)

As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1).

Actually what hsaito mentioned below is an NFC - just changing the order of load/store versus other instructions.

Secondly, I think we cannot do what you mentioned - that's what I started with originally and failed test cases. Specifically, isMaskedMemInst() looks for exactly that - memory instructions. See MaskedOp insertion in LoopVectorizationLegality::blockCanBePredicated
However, other instructions like divide etc are also predicated - see isScalarWithPredication for non-memory instructions.

4407 ↗(On Diff #164264)

This is the form of isScalarWithPredication that is called *before* making any widening decisions, i.e. all the calls without passing in the VF. Do we need to make the distinction between "no decision made yet" versus "decision made to scalarize instead if vectorization"?

5519 ↗(On Diff #164264)

will fix.

Ayal added inline comments.Sep 10 2018, 4:54 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5883 ↗(On Diff #162692)

OK, so we (continue to) assume that a scalar load + broadcast is better than a gather for the unconditional case, but fall back to general cost model based scalarizing versus gather for the conditional uniform load case. At-least until a scalar load + broadcast under a single 'if mask not empty' condition is introduced. The
// Conditional loads should be scalarized and predicated.
comment should be updated.

1428 ↗(On Diff #164264)

Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where other VF's gather/scatter, but they all predicate, or none do.

As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1).

The "interface" referred to above was isPredicatedInst(), which was called only from useEmulatedMaskMemRefHack() and getMemInstScalarizationCost(), and could be simplified to deal with loads/stores only; unlike isScalarWithPredication().

But looks like isPredicatedInst() was dropped, and replaced by calls to isScalarWithPredication() with default VF=1?

4407 ↗(On Diff #164264)

Yes, this was the distinction I suggested we should make: to devote isScalarWithPredication() with a mandatory VF parameter to mean "willBeScalarWithPredication()" for that VF, i.e., after decisions have been made; this is also in accordance with its original documentation:

/// Returns true if \p I is an instruction that will be scalarized with
/// predication.

, and have the 3 calls that take place before decisions have been made, each check explicitly what it needs:

  1. memoryInstructionCanBeWidened(): instead of
// If the instruction is a store located in a predicated block, it will be
// scalarized.
if (isScalarWithPredication(I))
  return false;

have something like

// If the instruction is a load/store located in a predicated block, requires a mask, and masked loads/stores are unavailable, the instruction will not be widened. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened" here.
if (blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store)
  return false;
  1. useEmulatedMaskMemRefHack(): instead of
assert(isPredicatedInst(I) && "Expecting a scalar emulated instruction");

have something like

if (not load nor store)
  return false;
assert(blockNeedsPredication && isMaskRequired && "Expecting a scalar emulated instruction");

similar to isPredicatedInst().

  1. getMemInstScalarizationCost: can also use "if (blockNeedsPredication && isMaskRequired)" as above. Note that conditional loads from safe addresses, i.e. when isMaskRequired is false, may still benefit from sinking under their condition when scalarized; but this is taken care of by computePredInstDiscount() / sinkScalarOperands().
  1. bump NumPredStores++ to after deciding to scalarize (and predicate), retaining the behavior of getMemInstScalarizationCost() which uses it.