Page MenuHomePhabricator

[LV] Drop integer poison-generating flags from instructions that need predication
ClosedPublic

Authored by dcaballe on Oct 14 2021, 3:12 PM.

Details

Summary

This patch fixes PR52111. The problem is that LV propagates poison-generating flags (nuw/nsw, exact
and inbounds) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. For example, in the following loop:

loop2.header:
  %iv2 = phi i64 [ 0, %loop1.header ], [ %iv2.inc, %if.end ]
  %i23 = icmp eq i64 %iv2, 0
  br i1 %i23, label %if.end, label %if.then

if.then:
  %i27 = sub nuw nsw i64 %iv2, 1
  %i29 = getelementptr inbounds [1 x [33 x float]], [1 x [33 x float]]* %input, i64 0, i64 %iv1, i64 %i27
...

'%i27' is a uniform operation related to address computation and it's guarded by condition
'%i23' (iv2 > 0). Its nuw flag is only valid due to this condition since the condition ensures
that the value of '%iv2' reaching '%i27' will always be greater than zero and, consequently,
the result of '%iv27' will always be positive. However, after vectorizing and linearizing
control flow, '%i27' (uniform, address computation) will also be executed for '%iv2 = 0' and
the nuw flag becomes invalid.

The fix drops all the integer poison-generating flags from instructions that contribute to the
address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Fixed RISCV test

spatel accepted this revision.Oct 22 2021, 6:57 AM

LGTM - the use of the standard dropPoisonGeneratingFlags is good, and the code comment about inbounds seems right, but I'm not familiar enough with LV (and these tests specifically) to know if there's more/less we should/could do, so please wait for at least one more reviewer to sign off.

This revision is now accepted and ready to land.Oct 22 2021, 6:57 AM

Thanks, Sanjay! Anything else, Florian/Roger/Roman?

lebedev.ri added a subscriber: nlopes.

I think this is right, but just to be extra sure, let's ask @nlopes ?

nlopes added a comment.EditedOct 27 2021, 9:24 AM

I don't know the code of the loop vectorizer, but I have some 2 concerns.

  1. Correctness:

The underlying issue of the bug report is that we widen a load to a masked vector load which takes only the address of the first iteration. We need to ensure this address is not poison if it wouldn't be poison for the non-masked out loads.
Dropping poison-producing flags from any operation that is hoisted of conditional BBs that contribute to the address is a good step forward. But it doesn't seem enough.
Consider this example:

loop:
  %i = phi [poison, %entry], [%i3, %loop]
  %i2 = phi [0, %entry], [%i3, loop]
  if (%i2 > 0) {
     %p = gep %p, %i
     load %p
  }
   %i3 = add %i2, 1
   br %cond, loop, ..

Now vectorize that load to a masked load of %p and we are doomed because %p is poison in the first iteration.
Not sure this example would kick in with the vectorizer as it needs to prove that loads are contiguous, but maybe SCEV will take the poison as 0 and vectorization kicks in. Anyway, just to say that just dropping poison-producing attributes may not be enough.

The second point is that it seems the patch drops attributes from all hoisted instructions, but that's not strictly needed. You only need to drop attributes from instructions that contribute to instructions that are widened and that produce UB if given poison. I don't think LLVM can produce e.g. a division of a vector by a scalar (if the loop always divides by the same value).

Thanks for the feedback, Nuno!

Dropping poison-producing flags from any operation that is hoisted of conditional BBs that contribute to the address is a good step forward. But it doesn't seem enough.

That's a great point. One option could be avoiding masking in basic blocks with direct or indirect uses of poison and scalarize that code instead. We could even only scalarize the impacted instructions and apply masking to the rest. I think addressing those cases would require much more involved changes. I agree that the current fix is a good step forward. We can address more complex cases incrementally.

The second point is that it seems the patch drops attributes from all hoisted instructions, but that's not strictly needed. You only need to drop attributes from instructions that contribute to instructions that are widened and that produce UB if given poison. I don't think LLVM can produce e.g. a division of a vector by a scalar (if the loop always divides by the same value).

LV can produce a division of a vector by a scalar by broadcasting the scalar into a vector. Is that what you mean?

I think the key points here are the instructions with the attributes and their guarding predicate. These instructions, *regardless of their uses*, may produce a poison value and UB themselves if the guarding predicate is dropped as a result of vectorization because they will be executed for iterations of the loop that wouldn't be executed in the original scalar loop. For instance, in the motivating example, a sub instruction will produce a poison value itself if its guarding predicate is dropped, regardless of whether the following gep/load using the result of the sub is widened or kept scalar. There are passes that will optimize away poison-generating instructions (i.e., this sub), leading to UB. For example, InstructionSimplify performs some of these optimizations (I think it turns this sub into a 0). Note that we are not blindly dropping all the poison-generating flags from all the instructions. We make sure that we only drop these flags from instructions were originally guarded by a predicate that has been dropped as a result of vectorization. If those instructions are scalarized (i.e., the predicate is preserved), only interleaved, etc., we are not dropping their flags.

Hopefully, that helps! Please, let me know if you have any other comments.

Thanks,
Diego

The second point is that it seems the patch drops attributes from all hoisted instructions, but that's not strictly needed. You only need to drop attributes from instructions that contribute to instructions that are widened and that produce UB if given poison. I don't think LLVM can produce e.g. a division of a vector by a scalar (if the loop always divides by the same value).

LV can produce a division of a vector by a scalar by broadcasting the scalar into a vector. Is that what you mean?

My point was that load/store are special in how they are widened. A vector load of %p is equivalent of load %p, load %p+1, load %p+2, etc.
So internally load produces the operands that would have been executed in subsequent iterations.
The root of the problem being addressed here is that we only pass the first operand and then have the subsequent computed internally by the operation. If we hit the case when the 1st iteration is the one masked out we may be in trouble.
It's a pretty cool bug! :)

Two conclusions from this:

  • If the 1st iteration (of the vectorized bundle) isn't masked out, we are good: no need to drop any attributes. This is because we would hit any UB in the original program anyway.
  • We only need to drop attributes from instructions that flow into the operands of instructions that have the property of internally computing the operands for subsequent sub-operations, like in load/store that increment pointers. I don't think LLVM has any other operation other than load/store with this property?

This means the fix should be limited to operands of load/store and for when only the 1st iteration is masked out.
Right now it seems that the code will drop attributes from any instruction regardless of whether it flows into a load/store or not. It's way too conservative. I would rather see it fixed properly now than get a promise of a fix in the future (that statistically rarely happen in the LLVM community).

To make things fully correct, there's also the concern that we need to ensure the value of the 1st iteration isn't itself already poison. This is a tricky dance with SCEV and I don't know if the example I posted previously would kick in right now, but could in the future, so at least we could add that as a unit test to make sure we don't regress if SCEV becomes smarter.

I think the key points here are the instructions with the attributes and their guarding predicate. These instructions, *regardless of their uses*, may produce a poison value and UB themselves if the guarding predicate is dropped as a result of vectorization because they will be executed for iterations of the loop that wouldn't be executed in the original scalar loop.

It's totally fine to execute instructions that yield poison. They won't lead to UB unless used.
Instructions that may produce UB themselves can't be hoisted unless predicated, but I hope that's already accounted for.

If the 1st iteration (of the vectorized bundle) isn't masked out, we are good: no need to drop any attributes. This is because we would hit any UB in the original program anyway.
We only need to drop attributes from instructions that flow into the operands of instructions that have the property of internally computing the operands for subsequent sub-operations, like in load/store that increment pointers. I don't think LLVM has any other operation other than load/store with this property?

I focused on fixing only address computation cases initially but with the ongoing discussion I understood I was oversimplifying the problem and that we should be more conservative and generalize it to more valid cases. What is not clear to me is what a valid case is. Hopefully, you can help me understand. Should dropping poison-generating flags be UB-driven only or should they be semantically correct based on the definition of each poison-generating flag? A few related questions for my understanding:

  • Should a regular vector add (unmasked) keep the nsw/nuw flags if one vector lane X might overflow?
  • What if the result of the vector add is used by a masked vector store which masks out lane X?
  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?
  • What about vector instructions feeding a vector GEP feeding a gather/scatter?
  • What about FMF (https://reviews.llvm.org/D112117)?

I think I need an answer to these questions to really understand what is needed. Hopefully, you can help me with this. It sounds really interesting! :).

The definition of the poison-generating flags seems ambiguous for vector types. If the properties of a flag might not hold within the context of the instruction itself and for all the vector lanes, wouldn’t it be semantically incorrect to keep that flag? Keeping the flag because the user of that instruction will mask out the invalid lanes sounds concerning to me but maybe I’m wrong here.

This means the fix should be limited to operands of load/store and for when only the 1st iteration is masked out.

If we wanted to go this way, we would have to prove that the first vector lane of at least one vector iteration is masked out, not necessarily the first lane of the first vector iteration. It seems complicated to prove that accurately. Any ideas? The guarding condition could be complex and missing just one flag would defeat the purpose of the fix. Maybe we could consider dropping all the flags in instructions involved in address computation. Would that be reasonable?

This is a tricky dance with SCEV and I don't know if the example I posted previously would kick in right now, but could in the future, so at least we could add that as a unit test to make sure we don't regress if SCEV becomes smarter.

LV bails out because the first phi node is not supported by the vectorizer. I can definitely add a test for this case.

It would be great to know what the other reviewers think about this!

Thanks,
Diego

  • Should a regular vector add (unmasked) keep the nsw/nuw flags if one vector lane X might overflow?

Yes. The result won't be used if it wasn't used by the original program (except for widened loads/stores, as discussed before)

  • What if the result of the vector add is used by a masked vector store which masks out lane X?

No problem. Masked store is equivalent to:
if (mask[i])

store v[i] p[i]

So v[i] can be poison because it won't be used.

  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?

Yep, same as the previous. But as long as the address is not used for a widened load/store.

  • What about vector instructions feeding a vector GEP feeding a gather/scatter?

That depends. Was the original program using that same value? If so, no problem. Otherwise I need a concrete example as I can't imagine one right away :)

The problem with floats with signaling NaNs I guess. It wouldn't be correct to execute those speculatively. But that's a totally different problem.
If it's just about FMF producing poison, then no worries and that patch doesn't seem necessary (because if something is broken, it's a much bigger problem).

The definition of the poison-generating flags seems ambiguous for vector types. If the properties of a flag might not hold within the context of the instruction itself and for all the vector lanes, wouldn’t it be semantically incorrect to keep that flag? Keeping the flag because the user of that instruction will mask out the invalid lanes sounds concerning to me but maybe I’m wrong here.

It's totally fine to execute instructions that produce poison as long as their value is not used. If you use them in a masked_store, the masked out values are not used.
This is correct:

if (i != 1)
  r[i] = a[i] +nsw b[i]
=>
tmp = a[i..i+3] +nsw [i..i+3]
store a, r, <1,0,1,1>

Sorry, I don't know the exact syntax, but the point is that you can execute the add for i == 1 with nsw as you'll never use the result.

This means the fix should be limited to operands of load/store and for when only the 1st iteration is masked out.

If we wanted to go this way, we would have to prove that the first vector lane of at least one vector iteration is masked out, not necessarily the first lane of the first vector iteration. It seems complicated to prove that accurately. Any ideas? The guarding condition could be complex and missing just one flag would defeat the purpose of the fix. Maybe we could consider dropping all the flags in instructions involved in address computation. Would that be reasonable?

If by that you mean dropping flags from instructions whose values flow into addresses of widened load/store operations, yes! That's a good way of fixing the 1st issue.
Unless you know something about the mask. If the 1st lane of every iteration is not masked out, you don't need to do anything as that value would be dereferenced in the original program.

This is a tricky dance with SCEV and I don't know if the example I posted previously would kick in right now, but could in the future, so at least we could add that as a unit test to make sure we don't regress if SCEV becomes smarter.

LV bails out because the first phi node is not supported by the vectorizer. I can definitely add a test for this case.

That's great! I was worried your bug report exposes 2 separate problems. I just didn't know if the 2nd one happened in practice or not.

I'm happy to answer further questions, of course. If you have concrete examples in mind even better as it's easier to communicate I think.

Thanks a lot for the explanation and the quick response...

  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?

Yep, same as the previous. But as long as the address is not used for a widened load/store.

Let me clarify... The GEP will feed a masked load/store. We won't load the data in that out-of-bounds address (masked out) but that address will be used as base for the masked load/store. It sounds like a case similar to the one exposing the bug. Since the address will be used, I understand we should drop the inbounds.

What about vector instructions feeding a vector GEP feeding a gather/scatter?

That depends. Was the original program using that same value? If so, no problem. Otherwise I need a concrete example as I can't imagine one right away :)

Gathers/scatters are modeled as a vector of pointers using a vector GEP so if an address is poison it will be masked out, at least, initially. Unfortunately, some backends will turn this vector of pointers into a single base pointer + offsets. If the base pointer is poison, we have the same issue as with masked loads/stores. This is basically a vector variant of that problem. I would suggest dropping the flags also here, following the same logic. Does it make sense?

The problem with floats with signaling NaNs I guess. It wouldn't be correct to execute those speculatively. But that's a totally different problem.
If it's just about FMF producing poison, then no worries and that patch doesn't seem necessary (because if something is broken, it's a much bigger problem).

Ok.

Maybe we could consider dropping all the flags in instructions involved in address computation. Would that be reasonable?

If by that you mean dropping flags from instructions whose values flow into addresses of widened load/store operations, yes! That's a good way of fixing the 1st issue.
Unless you know something about the mask. If the 1st lane of every iteration is not masked out, you don't need to do anything as that value would be dereferenced in the original program.

Ok, I can add a check that looks for a widen load/store in the def-use chain.

Thanks!

Thanks a lot for the explanation and the quick response...

  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?

Yep, same as the previous. But as long as the address is not used for a widened load/store.

Let me clarify... The GEP will feed a masked load/store. We won't load the data in that out-of-bounds address (masked out) but that address will be used as base for the masked load/store. It sounds like a case similar to the one exposing the bug. Since the address will be used, I understand we should drop the inbounds.

A masked load of a masked out lane is a NOP essentially, so the address can be poison as it's not actually used. You don't need to drop inbounds.
LangRef doesn't require the address of masked out lanes to be aligned, as in e.g. memcpy that even with size=0 the address must be properly aligned.

What about vector instructions feeding a vector GEP feeding a gather/scatter?

That depends. Was the original program using that same value? If so, no problem. Otherwise I need a concrete example as I can't imagine one right away :)

Gathers/scatters are modeled as a vector of pointers using a vector GEP so if an address is poison it will be masked out, at least, initially. Unfortunately, some backends will turn this vector of pointers into a single base pointer + offsets. If the base pointer is poison, we have the same issue as with masked loads/stores. This is basically a vector variant of that problem. I would suggest dropping the flags also here, following the same logic. Does it make sense?

Sounds like the same problem, yes. But just dropping flags isn't enough, because similarly the base pointer could have been poison already.
Those backends are buggy. Fixing requires either proving the base pointer isn't poison in the fist place, or deriving a base address from a non-masked out lane (probably the easiest solution?).

Let me clarify... The GEP will feed a masked load/store. We won't load the data in that out-of-bounds address (masked out) but that address will be used as base for the masked load/store. It sounds like a case similar to the one exposing the bug. Since the address will be used, I understand we should drop the inbounds.

A masked load of a masked out lane is a NOP essentially, so the address can be poison as it's not actually used. You don't need to drop inbounds.
LangRef doesn't require the address of masked out lanes to be aligned, as in e.g. memcpy that even with size=0 the address must be properly aligned.

Ok, let me update the patch so that we can discuss on actual examples. I have the impression that we both agree but might be looking at it from a different perspective.

Gathers/scatters are modeled as a vector of pointers using a vector GEP so if an address is poison it will be masked out, at least, initially. Unfortunately, some backends will turn this vector of pointers into a single base pointer + offsets. If the base pointer is poison, we have the same issue as with masked loads/stores. This is basically a vector variant of that problem. I would suggest dropping the flags also here, following the same logic. Does it make sense?

Sounds like the same problem, yes. But just dropping flags isn't enough, because similarly the base pointer could have been poison already.
Those backends are buggy. Fixing requires either proving the base pointer isn't poison in the fist place, or deriving a base address from a non-masked out lane (probably the easiest solution?).

Deriving a base address from a non-masked out lane makes sense but we don't always have that information at compile time. In any case, this patch is a good starting point to fix the backend problems. We can iterate on it later as we see fit.

dcaballe updated this revision to Diff 383687.Sun, Oct 31, 3:01 PM
  • Added check to make sure we only drop poison-generating flags from instructions contributing to the address computation of masked loads/stores.
  • Removed logic to drop flags from widen GEPs (for gathers/scatters)
  • Removed logic to drop flags from all the widen instructions.
  • Reverted changes in impacted tests.
nlopes added a comment.Wed, Nov 3, 6:52 AM

Looks good to me. Added a couple of suggestions to the code.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3144

isn't it inefficient to recompute this information for every operation? You get a O(n^2) behavior (up to n calls of this function, each traversing up to n instructions)

3151

An address computation may depend on float operations. e.g. you may have a float -> int, a select based on a float comparison, etc.

3166

you don't need to consider users of loads (nor of stores). you can break here if you want if WidenMemRec != nullptr.

dcaballe updated this revision to Diff 384745.Thu, Nov 4, 7:17 AM
dcaballe marked 2 inline comments as done.

Addressed feedback.
Changed the approach to improve the complexity of detecting the
poison-generating recipes. We now gather these recipes before
executing the VPlan, starting from the recipes generating a widen
load/store and traversing the backward slice from their address
operand. In this way, recipes are only visited once.

Please, let me know if there is any other comment.

Thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3144

I changed the approach so that we gather the target recipes before executing the VPlan.

3151

True, removed that check.

3166

It makes sense. Done. Updating doc accordingly.

nlopes accepted this revision.Thu, Nov 4, 12:26 PM

looks correct to me. Left just one perf suggestion.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1195

this visited set can be hoisted outside the lambda. You should only traverse each instruction at most once, as you need to drop poison flags if the instructions contributes to any address (doesn't matter which)

fhahn added a comment.Sun, Nov 7, 1:16 AM

Thanks for the update! I think the description/title would need a minor update, as they still talk about 'integer' poison generating flags.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
889

This is only used during codegen, right? In that case, it would probably be better to just add it to VPTransformState instead. Otherwise it just adds a little bit of additional coupling between ILV and codegen, making it harder to separate in the future.

1219

Is there a way to avoid querying information from legal here?

As we are interested in any operation contributing to an address computation of a masked memory operation, could we only seed the worklist with addresses of masked operations instead (both VPWidenMemoryInstructionRecipe & VPInterleaveRecipe have hasMask/getMask helpers)? Then we would not need to check legal here, right?

Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.

1225

Do we in some cases also need to drop flags from recipes other then VPReplicateRecipe?

If we need to drop flags from recipes in unpredicted blocks (see comment above). then we could have cases where a vector operation (e.g. VPWidenRecipe) feeds the address computation (the first lane will get extracted). An example could be

define void @drop_vector(float* noalias nocapture readonly %input,
                                 float* %output, i64* noalias %ints) local_unnamed_addr #0 {
entry:
  br label %loop.header

loop.header:
  %iv = phi i64 [ 0, %entry ], [ %iv.inc, %if.end ]
  %i23 = icmp eq i64 %iv, 0
  %gep = getelementptr inbounds i64, i64* %ints, i64 %iv
  %lv = load i64, i64* %gep
  %i27 = sub nuw nsw i64 %iv, 1
  %r = add i64 %lv, %i27
  store i64 %r, i64* %gep
  br i1 %i23, label %if.end, label %if.then

if.then:
  %i29 = getelementptr inbounds float, float* %input, i64 %i27
  %i30 = load float, float* %i29, align 4, !invariant.load !0
  br label %if.end

if.end:
  %i34 = phi float [ 0.000000e+00, %loop.header ], [ %i30, %if.then ]
  %i35 = getelementptr inbounds float, float* %output, i64 %iv
  store float %i34, float* %i35, align 4
  %iv.inc = add nuw nsw i64 %iv, 1
  %exitcond = icmp eq i64 %iv.inc, 4
  br i1 %exitcond, label %loop.exit, label %loop.header

loop.exit:
  ret void
}
attributes #0 = { noinline nounwind uwtable "target-features"="+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl" }

!0 = !{}

Thanks for the feedback, Florian! I added some comments. Working on some of the changes while waiting for some clarifications. Thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
889

Sure, will do!

1195

It makes sense. Will do.

1219

As we are interested in any operation contributing to an address computation of a masked memory operation, could we only seed the worklist with addresses of masked operations instead (both VPWidenMemoryInstructionRecipe & VPInterleaveRecipe have hasMask/getMask helpers)? Then we would not need to check legal here, right?

I had tried using getMask already. Unfortunately, there are cases where the loads are predicated but the mask has been dropped in favor of executing them speculatively unmasked. See https://reviews.llvm.org/D66688 and test load-deref-pred.ll. We should drop the flags from these cases as well and the only way I found is using Legal->blockNeedsPredication. I know it's not ideal but Legal is still widely used during codegen.

Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.

I think that makes sense. It would a case similar to the one that exposed the bug but moving the subi instruction outside the condition. I'll give it a try!

1225

I assumed that address computation would be uniform for those cases and left scalar, wouldn't it?. I'll give it a try but there has been some back and forth wrt dropping the flags from vector instructions already. Based on previous feedback, I'm not sure if we should drop the flags from them when there are lanes that won't generate poison. I would need some clarification before proceeding with those changes (@nlopes).

dcaballe added inline comments.Sun, Nov 7, 3:00 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1225

Looking closely at your example, the poison value generated by 'sub' in the first iteration of the loop is actually used in the scalar version of the code (it is stored). For that reason, my impression is that having the nuw flag in the input sub would be incorrect, isn't that the case?

dcaballe updated this revision to Diff 385340.Sun, Nov 7, 6:32 AM
  • Added support for non-predicated poison-generating instruction cases.
  • Move MayGeneratePoisonRecipes to VPTransformState.
  • Update documentation.
dcaballe updated this revision to Diff 385341.Sun, Nov 7, 6:43 AM
dcaballe marked 3 inline comments as done.

Moved visited set outside of lambda function

dcaballe added inline comments.Sun, Nov 7, 6:46 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1219

Using getMask can't be used for the interleave recipe either. There are interleave groups that are formed out of non-predicated loads/stores. However, the resulting interleave group may still require a mask if the to-be-generated widen load reads more elements than those needed for the group. We should not drop poison flags in those cases.

dcaballe edited the summary of this revision. (Show Details)Sun, Nov 7, 7:58 AM
nlopes added inline comments.Sun, Nov 7, 12:00 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1225

I mentioned this issue before: what happens if a value was already poison in the first place?

I suggest you commit this patch first, as it addresses half of the problem, and then we can discuss what's the best way to fix the second part. Just dropping flags from even all instructions within the function isn't sufficient as you may get a poison value as input (change the example above to have [%arg, entry ] rather than [ 0, entry ] for %iv. Though you only need to patch the code for the cases where you cannot prove the code would execute in the scalar version. The right fix isn't trivial.

dcaballe updated this revision to Diff 385436.Mon, Nov 8, 3:08 AM

Fixed failing test

Ok, I'll wait for Florian's approval to land this patch.

Thanks!
Diego

fhahn added inline comments.Mon, Nov 8, 1:49 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1219

I had tried using getMask already. Unfortunately, there are cases where the loads are predicated but the mask has been dropped in favor of executing them speculatively unmasked. See https://reviews.llvm.org/D66688 and test load-deref-pred.ll. We should drop the flags from these cases as well and the only way I found is using Legal->blockNeedsPredication. I know it's not ideal but Legal is still widely used during codegen.

Hmm, that's unfortunate! There are still plenty of uses of Legal, but each new use we add makes things more difficult to transition. Let me take a look if there's another alternative, but if there's none it's not the end of the world for now.

1225

I tried reading the discussion again, but I'm not sure why it would matter whether the sub gets widened to a vector or not for the test case. In the test case, the issue should still be the same as in the motivating test case, the flags on the instruction can poison (some) vector lanes, independent of whether the inputs were poison originally.

In the example, the sub nuw nsw gets widened, but similar to the other test cases, the first lane is used in the address computation of the masked memory operation. Effectively I am not sure I see the difference whether we compute only the first lane (because we scalarized the sub) or if we compute all lanes and use the first lane only in a UB-on-poison op and have other non-UB-on-poison uses of the full vector (storing a vector with (some) poison lanes should not be UB).

If we only restrict this to VPReplicateRecipe, it seems like we can easily run into slight variations of the motivating test case where we still have the same issue as the patch is trying to fix.

Now if we support non-block dropping flags from non-BB-local instruction/recipes, there are cases where we may not need to drop flags, e.g. because we can prove that poison generated by the instruction cause UB unconditionally. In those cases, I think for now our priority should be correctness, as it’s unlikely for poison-generating flags to make a notable difference for the vector loops during codegen.

1249

I think for consistency with the LLVM style used here variables should start with an upper case letter (https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly)

llvm/test/Transforms/LoopVectorize/X86/drop-poison-generating-flags.ll
82–83

Should nuw nsw be retained in the source here and the same for inbounds. It looks like they got dropped.

dcaballe added inline comments.Tue, Nov 9, 2:33 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1219

Sure! Note that blockNeedsPredication is used in quite a few places so all of them should be addressed in the same way.

1225

Semantics on poison-generating flags on vector instructions are not clear to me, even after the discussion. I added support for those cases initially, then removed it... It should be easy to add them again but I want to make sure we agree on that before introducing any changes. If those cases need further discussion I'll be happy to address them in a separate commit. I'm not too concerned about those cases since I don't think LLVM can replace the value of a specific lane with a poison value at this point.

Now if we support non-block dropping flags from non-BB-local instruction/recipes, there are cases where we may not need to drop flags, e.g. because we can prove that poison generated by the instruction cause UB unconditionally. In those cases, I think for now our priority should be correctness, as it’s unlikely for poison-generating flags to make a notable difference for the vector loops during codegen.

Not sure I fully understand this part but the key point, as discussed previously, is the usage of the poison value. We may have an instruction that may generates a poison value unconditionally. It should be ok as long as the potential poison value is not used. We should drop the flags if the potential poison value happens to be used after vectorization. That's exactly what the latest implementation is doing.

llvm/test/Transforms/LoopVectorize/X86/drop-poison-generating-flags.ll
82–83

This is the test you asked me to add, isn't it?

Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.`

sub and getelementptr are not predicated but feed a load that is predicated.

dcaballe updated this revision to Diff 385838.Tue, Nov 9, 8:27 AM

Fixed naming convention issues

dcaballe updated this revision to Diff 385902.Tue, Nov 9, 11:01 AM

Since it was already implemented in a previous version of the code, I restored the logic to drop flags
from Widen and WidenGEP recipes following Florian's suggestion. These changes only impact Florian's
test case. I modified the test case to also have a vector GEP feeding a masked load to also cover
that use case.

Let me know if you have any further comments.
Thanks

Any other comments? :)

Thanks for the update! I left a few small remaining comments. Basically LGTM, but it would still be good to hear @nlopes thoughts on the recent responses with respect to widened instructions.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1213

I think it would be helpful to add a comment spelling out what we are looking for here.

1216

I'm not entirely sure why we need to check the parent region/replicator? It would be great if you could include the reasoning in the comment above, or remove if it is not needed.

1231

nit: it might be more explicit to pass the plan as argument instead of accessing it from State.

dcaballe updated this revision to Diff 387402.Mon, Nov 15, 2:13 PM
dcaballe marked 2 inline comments as done.
  • Updated and added more comments.
  • Simplified redundant condition.

Thanks, Florian! I addressed the feedback.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1216

Good catch! Actually most of these conditions are redundant or irrelevant in the new algorithm since we already know this recipe is contributing to the address computation of a VPWidenMemoryRecipe or VPInterleaveRecipe. That simplifies a lot the condition. Thanks!

1231

We are also using other fields from State here, like VF and MayGeneratePoisonRecipes and this is the only used of Plan. I'm not sure if it's worth it.

I talked to Nuno in private and he mentioned that I could go ahead and commit the changes and address any minor feedback in a separate commit since he is very busy right now.
I'll commit this on Monday if no other comments.

Thank you all for the feedback!
Diego

fhahn accepted this revision.Sun, Nov 21, 1:58 PM

I talked to Nuno in private and he mentioned that I could go ahead and commit the changes and address any minor feedback in a separate commit since he is very busy right now.
I'll commit this on Monday if no other comments.

Thank you all for the feedback!
Diego

Sounds good to me! LGTM, thanks.

This patch already stretches a bit my knowledge of the vectorizer code, but the examples look good.
Thanks @fhahn for the test case. And apologies for the delay in getting back to this; I had a deadline Friday and was quite busy.

We should fuzz this thing. I'm not super comfortable still. I've a few theoretical concerns, but I don't know if they trigger in practice.

Thanks for all the feedback! I think this will cover a good part of the cases that could actually go wrong in practice. We can accommodate more cases as we see fit.