This is an archive of the discontinued LLVM Phabricator instance.

[LV] Drop integer poison-generating flags from instructions that need predication
ClosedPublic

Authored by dcaballe on Oct 14 2021, 3:12 PM.

Details

Summary

This patch fixes PR52111. The problem is that LV propagates poison-generating flags (nuw/nsw, exact
and inbounds) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. For example, in the following loop:

loop2.header:
  %iv2 = phi i64 [ 0, %loop1.header ], [ %iv2.inc, %if.end ]
  %i23 = icmp eq i64 %iv2, 0
  br i1 %i23, label %if.end, label %if.then

if.then:
  %i27 = sub nuw nsw i64 %iv2, 1
  %i29 = getelementptr inbounds [1 x [33 x float]], [1 x [33 x float]]* %input, i64 0, i64 %iv1, i64 %i27
...

'%i27' is a uniform operation related to address computation and it's guarded by condition
'%i23' (iv2 > 0). Its nuw flag is only valid due to this condition since the condition ensures
that the value of '%iv2' reaching '%i27' will always be greater than zero and, consequently,
the result of '%iv27' will always be positive. However, after vectorizing and linearizing
control flow, '%i27' (uniform, address computation) will also be executed for '%iv2 = 0' and
the nuw flag becomes invalid.

The fix drops all the integer poison-generating flags from instructions that contribute to the
address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.

Diff Detail

Event Timeline

dcaballe created this revision.Oct 14 2021, 3:12 PM
dcaballe requested review of this revision.Oct 14 2021, 3:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2021, 3:12 PM

Thanks @dcaballe for the patch. A bit surprising that we didn't notice this earlier but I guess masking does not get used that often.

I think it makes sense to get rid of nuw and nsw if we are explicitly scalarising something out of its (scalar) predicate.

I understand that alternatives like extending LoopVectorizationCostModel::isScalarWithPredication or VPRecipeBuilder::handleReplication would be worse. They'd entail the loop to be costed higher or have an extra branch inside the vector loop that we do not seem to need, right?

lebedev.ri added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3158

What about exact on division?

fhahn added a comment.Oct 18 2021, 9:25 AM

Thanks for the patch! Some comments inline.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3156

Def here should always by a VPReplicateRecipe I think, so you should be able to use cast<> instead. Or maybe even better update the function signature to pass a single VPReplicateRecipe reference instead both VPValue *Def and VPUser &User.

llvm/test/Transforms/LoopVectorize/pr52111.ll
3 ↗(On Diff #379861)

Can you pre-commit the test?

8 ↗(On Diff #379861)

It might be worth to match a bit more context here, e.g. a full triangle inside the vector body

13 ↗(On Diff #379861)

This should not be required. If it is, the test would need to be moved to the X86 test directory.

16 ↗(On Diff #379861)

could we just use float * instead of more nested types to make the test a bit simpler?

21 ↗(On Diff #379861)

would a single loop nest suffice or is a nested loop needed?

53 ↗(On Diff #379861)

Is this needed?

57 ↗(On Diff #379861)

is all that metadata needed? Might be better to use the -force-vector-width=X option instead of metadata, as then the vectorization factor is a bit more obvious from the run line directly.

lebedev.ri added inline comments.Oct 18 2021, 9:28 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3158

(... and other FP fast-math flags)

Thank you all for the quick response!

A bit surprising that we didn't notice this earlier but I guess masking does not get used that often.

Interestingly, we only observed this issue in AVX512. It's very likely that the cost model is heavily penalizing masking for some ISAs so this scenario wouldn't happen that often.

I understand that alternatives like extending LoopVectorizationCostModel::isScalarWithPredication or VPRecipeBuilder::handleReplication would be worse. They'd entail the loop to be costed higher or have an extra branch inside the vector loop that we do not seem to need, right?

Yeah, I don't think we should penalize or change the vectorization strategy for these cases since the nuw/nsw flags are actually not impacting the performance of the generated vector code.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3158

Good catch, thanks! I can't think of an example that applies to FP fast-math flags. The problem happens when an instruction in a predicated block is scalarized without predicate. I can only think of address computation cases that would fall into that category. Do you have any other case in mind?

I'll try to write a test for the division case and I can follow up on any other cases we find separately. This is really blocking internal work. Does it sound reasonable?

llvm/test/Transforms/LoopVectorize/pr52111.ll
3 ↗(On Diff #379861)

Do you want me to commit it before the fix and mark it as failure?

53 ↗(On Diff #379861)

The problem is that this loop was only vectorized when targeting AVX512. The loop might be vectorized differently if I remove some of these flags. Let me try to simplify all of this and see what happens.

dcaballe updated this revision to Diff 380589.Oct 18 2021, 11:13 PM
dcaballe marked 10 inline comments as done.

Addressed feedback.
When creating a test case for the division instruction I realized that the problem
could also happen for vectorized instructions. For example, the address computation
of a memory access would also be vectorized if the access is a gather/scatter. I
added support for those cases to the VPWidenRecipe.

Can you explain why this isn't applicable to FP fast-math flags also?

Can you explain why this isn't applicable to FP fast-math flags also?

The use cases that I've seen naturally happening are related to memory address computation. For those cases, I think FP instructions wouldn't make sense, unless we create some artificial IR with address computation in FP that is then converted to integers. For more generic cases, of course, we could create artificial/non-optimized IR that would expose the problem on FP but I don't see how that IR would make their way to LV with the current pipelines. If you can provide a test case, I'll be happy to address it in a separate patch. I don't think we should block this patch until we cover all the potential cases because the problem could be more complex that we even know right now. For example, linearization of the control flow might also impact __builtin_assume (not sure if we are preserving those after vectorization) and other metadata guarded by a predicate. These generic cases can be addressed incrementally by adding the corresponding logic to clearUnsafeFlagsAfterLinearizingCF as we see fit.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3156

Yeah, it makes sense. Thanks!

llvm/test/Transforms/LoopVectorize/pr52111.ll
8 ↗(On Diff #379861)

I'm matching more instructions but note that there is no triangle (if I understand you correctly) since the control flow is linearized.

13 ↗(On Diff #379861)

It is since there's no flag (?) to force masked vectorization. I need to target AVX512. Otherwise, the loop is vectorized with predicates, which doesn't expose the problem.

53 ↗(On Diff #379861)

I only left the avx512 metadata

57 ↗(On Diff #379861)

Removed all the metadata

I've been playing with this further and I can't find a way to expose the problem with FP. If anybody could provide a test case, that would be helpful, or maybe any idea that I can try. All my attempts end up being simplified by passes before LV.

I've been playing with this further and I can't find a way to expose the problem with FP. If anybody could provide a test case, that would be helpful, or maybe any idea that I can try. All my attempts end up being simplified by passes before LV.

I'm not quite sure it is reasonable to rely on other passes to prevent correctness issues in some other pass.
Does it not reproduce with a manually-written IR that only runs LV pass?

dcaballe retitled this revision from [LV] Drop NUW/NSW flags from scalarized instructions that need predication to [LV] Drop NUW/NSW flags from instructions that need predication.Oct 19 2021, 6:29 PM

I created D112117 to address the FP cases. It already implements the logic needed to drop some FP flags. I'm not even 100% sure about which FP flags should be dropped after predication. Definitely not all of them. Let's have this discussion in that code review. The NUW/NSW correctness issues is blocking internal work. Hopefully, this patch can move forward while we better understand what is needed for the FP cases.

Thanks,
Diego

spatel added inline comments.Oct 20 2021, 7:45 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3131

This could be Instruction::dropPoisonGeneratingFlags() instead of new code? That also handles GEP's inbounds flag. If inbounds is also a concern with this transform, can we add a test for it?

Note that dropPoisonGeneratingFlags has a TODO comment for FMF.

dcaballe added inline comments.Oct 20 2021, 10:16 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3131

That also handles GEP's inbounds flag. If inbounds is also a concern with this transform, can we add a test for it?

Thanks! That makes sense, I think. The load/store will be inbounds because we are masking out out-of-bounds elements but the GEP could be out-of-bounds.

This could be Instruction::dropPoisonGeneratingFlags() instead of new code?

Sure. The only problem I see is extending this utility to FMF. The utility is broadly used in LLVM (ValueTracking, LoopUtils, SimplifyIndVar, InstCombine*, BDCE, etc.). It might be hard to add FMF if that breaks all these passes, but we can try.

spatel added inline comments.Oct 20 2021, 12:52 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3131

I think it's low risk - those passes are mostly dealing with integer transforms when using that call.
Even if we change something by including FMF (nnan and ninf) in the list, it should be a perf regression at worst (we shouldn't break correctness by removing FMF or wrap/exact flags).

dcaballe updated this revision to Diff 381129.Oct 20 2021, 6:49 PM

Addressed the feedback:

  • Replaced 'clearUnsafeFlagsAfterLinearizingCF' with 'Instruction::dropPoisonGeneratingFlags'
  • Added logic to drop 'inbounds' from a VPGepWidenRecipe
  • Addressed all the tests impacted by the new changes (quite a few)

Please, let me know if anything else is missing.
Thanks!

dcaballe retitled this revision from [LV] Drop NUW/NSW flags from instructions that need predication to [LV] Drop integer poison-generating flags from instructions that need predication.Oct 20 2021, 6:53 PM
dcaballe edited the summary of this revision. (Show Details)
dcaballe updated this revision to Diff 381371.Oct 21 2021, 1:12 PM

Fixed RISCV test

spatel accepted this revision.Oct 22 2021, 6:57 AM

LGTM - the use of the standard dropPoisonGeneratingFlags is good, and the code comment about inbounds seems right, but I'm not familiar enough with LV (and these tests specifically) to know if there's more/less we should/could do, so please wait for at least one more reviewer to sign off.

This revision is now accepted and ready to land.Oct 22 2021, 6:57 AM

Thanks, Sanjay! Anything else, Florian/Roger/Roman?

lebedev.ri added a subscriber: nlopes.

I think this is right, but just to be extra sure, let's ask @nlopes ?

nlopes added a comment.EditedOct 27 2021, 9:24 AM

I don't know the code of the loop vectorizer, but I have some 2 concerns.

  1. Correctness:

The underlying issue of the bug report is that we widen a load to a masked vector load which takes only the address of the first iteration. We need to ensure this address is not poison if it wouldn't be poison for the non-masked out loads.
Dropping poison-producing flags from any operation that is hoisted of conditional BBs that contribute to the address is a good step forward. But it doesn't seem enough.
Consider this example:

loop:
  %i = phi [poison, %entry], [%i3, %loop]
  %i2 = phi [0, %entry], [%i3, loop]
  if (%i2 > 0) {
     %p = gep %p, %i
     load %p
  }
   %i3 = add %i2, 1
   br %cond, loop, ..

Now vectorize that load to a masked load of %p and we are doomed because %p is poison in the first iteration.
Not sure this example would kick in with the vectorizer as it needs to prove that loads are contiguous, but maybe SCEV will take the poison as 0 and vectorization kicks in. Anyway, just to say that just dropping poison-producing attributes may not be enough.

The second point is that it seems the patch drops attributes from all hoisted instructions, but that's not strictly needed. You only need to drop attributes from instructions that contribute to instructions that are widened and that produce UB if given poison. I don't think LLVM can produce e.g. a division of a vector by a scalar (if the loop always divides by the same value).

Thanks for the feedback, Nuno!

Dropping poison-producing flags from any operation that is hoisted of conditional BBs that contribute to the address is a good step forward. But it doesn't seem enough.

That's a great point. One option could be avoiding masking in basic blocks with direct or indirect uses of poison and scalarize that code instead. We could even only scalarize the impacted instructions and apply masking to the rest. I think addressing those cases would require much more involved changes. I agree that the current fix is a good step forward. We can address more complex cases incrementally.

The second point is that it seems the patch drops attributes from all hoisted instructions, but that's not strictly needed. You only need to drop attributes from instructions that contribute to instructions that are widened and that produce UB if given poison. I don't think LLVM can produce e.g. a division of a vector by a scalar (if the loop always divides by the same value).

LV can produce a division of a vector by a scalar by broadcasting the scalar into a vector. Is that what you mean?

I think the key points here are the instructions with the attributes and their guarding predicate. These instructions, *regardless of their uses*, may produce a poison value and UB themselves if the guarding predicate is dropped as a result of vectorization because they will be executed for iterations of the loop that wouldn't be executed in the original scalar loop. For instance, in the motivating example, a sub instruction will produce a poison value itself if its guarding predicate is dropped, regardless of whether the following gep/load using the result of the sub is widened or kept scalar. There are passes that will optimize away poison-generating instructions (i.e., this sub), leading to UB. For example, InstructionSimplify performs some of these optimizations (I think it turns this sub into a 0). Note that we are not blindly dropping all the poison-generating flags from all the instructions. We make sure that we only drop these flags from instructions were originally guarded by a predicate that has been dropped as a result of vectorization. If those instructions are scalarized (i.e., the predicate is preserved), only interleaved, etc., we are not dropping their flags.

Hopefully, that helps! Please, let me know if you have any other comments.

Thanks,
Diego

The second point is that it seems the patch drops attributes from all hoisted instructions, but that's not strictly needed. You only need to drop attributes from instructions that contribute to instructions that are widened and that produce UB if given poison. I don't think LLVM can produce e.g. a division of a vector by a scalar (if the loop always divides by the same value).

LV can produce a division of a vector by a scalar by broadcasting the scalar into a vector. Is that what you mean?

My point was that load/store are special in how they are widened. A vector load of %p is equivalent of load %p, load %p+1, load %p+2, etc.
So internally load produces the operands that would have been executed in subsequent iterations.
The root of the problem being addressed here is that we only pass the first operand and then have the subsequent computed internally by the operation. If we hit the case when the 1st iteration is the one masked out we may be in trouble.
It's a pretty cool bug! :)

Two conclusions from this:

  • If the 1st iteration (of the vectorized bundle) isn't masked out, we are good: no need to drop any attributes. This is because we would hit any UB in the original program anyway.
  • We only need to drop attributes from instructions that flow into the operands of instructions that have the property of internally computing the operands for subsequent sub-operations, like in load/store that increment pointers. I don't think LLVM has any other operation other than load/store with this property?

This means the fix should be limited to operands of load/store and for when only the 1st iteration is masked out.
Right now it seems that the code will drop attributes from any instruction regardless of whether it flows into a load/store or not. It's way too conservative. I would rather see it fixed properly now than get a promise of a fix in the future (that statistically rarely happen in the LLVM community).

To make things fully correct, there's also the concern that we need to ensure the value of the 1st iteration isn't itself already poison. This is a tricky dance with SCEV and I don't know if the example I posted previously would kick in right now, but could in the future, so at least we could add that as a unit test to make sure we don't regress if SCEV becomes smarter.

I think the key points here are the instructions with the attributes and their guarding predicate. These instructions, *regardless of their uses*, may produce a poison value and UB themselves if the guarding predicate is dropped as a result of vectorization because they will be executed for iterations of the loop that wouldn't be executed in the original scalar loop.

It's totally fine to execute instructions that yield poison. They won't lead to UB unless used.
Instructions that may produce UB themselves can't be hoisted unless predicated, but I hope that's already accounted for.

If the 1st iteration (of the vectorized bundle) isn't masked out, we are good: no need to drop any attributes. This is because we would hit any UB in the original program anyway.
We only need to drop attributes from instructions that flow into the operands of instructions that have the property of internally computing the operands for subsequent sub-operations, like in load/store that increment pointers. I don't think LLVM has any other operation other than load/store with this property?

I focused on fixing only address computation cases initially but with the ongoing discussion I understood I was oversimplifying the problem and that we should be more conservative and generalize it to more valid cases. What is not clear to me is what a valid case is. Hopefully, you can help me understand. Should dropping poison-generating flags be UB-driven only or should they be semantically correct based on the definition of each poison-generating flag? A few related questions for my understanding:

  • Should a regular vector add (unmasked) keep the nsw/nuw flags if one vector lane X might overflow?
  • What if the result of the vector add is used by a masked vector store which masks out lane X?
  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?
  • What about vector instructions feeding a vector GEP feeding a gather/scatter?
  • What about FMF (https://reviews.llvm.org/D112117)?

I think I need an answer to these questions to really understand what is needed. Hopefully, you can help me with this. It sounds really interesting! :).

The definition of the poison-generating flags seems ambiguous for vector types. If the properties of a flag might not hold within the context of the instruction itself and for all the vector lanes, wouldn’t it be semantically incorrect to keep that flag? Keeping the flag because the user of that instruction will mask out the invalid lanes sounds concerning to me but maybe I’m wrong here.

This means the fix should be limited to operands of load/store and for when only the 1st iteration is masked out.

If we wanted to go this way, we would have to prove that the first vector lane of at least one vector iteration is masked out, not necessarily the first lane of the first vector iteration. It seems complicated to prove that accurately. Any ideas? The guarding condition could be complex and missing just one flag would defeat the purpose of the fix. Maybe we could consider dropping all the flags in instructions involved in address computation. Would that be reasonable?

This is a tricky dance with SCEV and I don't know if the example I posted previously would kick in right now, but could in the future, so at least we could add that as a unit test to make sure we don't regress if SCEV becomes smarter.

LV bails out because the first phi node is not supported by the vectorizer. I can definitely add a test for this case.

It would be great to know what the other reviewers think about this!

Thanks,
Diego

  • Should a regular vector add (unmasked) keep the nsw/nuw flags if one vector lane X might overflow?

Yes. The result won't be used if it wasn't used by the original program (except for widened loads/stores, as discussed before)

  • What if the result of the vector add is used by a masked vector store which masks out lane X?

No problem. Masked store is equivalent to:
if (mask[i])

store v[i] p[i]

So v[i] can be poison because it won't be used.

  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?

Yep, same as the previous. But as long as the address is not used for a widened load/store.

  • What about vector instructions feeding a vector GEP feeding a gather/scatter?

That depends. Was the original program using that same value? If so, no problem. Otherwise I need a concrete example as I can't imagine one right away :)

The problem with floats with signaling NaNs I guess. It wouldn't be correct to execute those speculatively. But that's a totally different problem.
If it's just about FMF producing poison, then no worries and that patch doesn't seem necessary (because if something is broken, it's a much bigger problem).

The definition of the poison-generating flags seems ambiguous for vector types. If the properties of a flag might not hold within the context of the instruction itself and for all the vector lanes, wouldn’t it be semantically incorrect to keep that flag? Keeping the flag because the user of that instruction will mask out the invalid lanes sounds concerning to me but maybe I’m wrong here.

It's totally fine to execute instructions that produce poison as long as their value is not used. If you use them in a masked_store, the masked out values are not used.
This is correct:

if (i != 1)
  r[i] = a[i] +nsw b[i]
=>
tmp = a[i..i+3] +nsw [i..i+3]
store a, r, <1,0,1,1>

Sorry, I don't know the exact syntax, but the point is that you can execute the add for i == 1 with nsw as you'll never use the result.

This means the fix should be limited to operands of load/store and for when only the 1st iteration is masked out.

If we wanted to go this way, we would have to prove that the first vector lane of at least one vector iteration is masked out, not necessarily the first lane of the first vector iteration. It seems complicated to prove that accurately. Any ideas? The guarding condition could be complex and missing just one flag would defeat the purpose of the fix. Maybe we could consider dropping all the flags in instructions involved in address computation. Would that be reasonable?

If by that you mean dropping flags from instructions whose values flow into addresses of widened load/store operations, yes! That's a good way of fixing the 1st issue.
Unless you know something about the mask. If the 1st lane of every iteration is not masked out, you don't need to do anything as that value would be dereferenced in the original program.

This is a tricky dance with SCEV and I don't know if the example I posted previously would kick in right now, but could in the future, so at least we could add that as a unit test to make sure we don't regress if SCEV becomes smarter.

LV bails out because the first phi node is not supported by the vectorizer. I can definitely add a test for this case.

That's great! I was worried your bug report exposes 2 separate problems. I just didn't know if the 2nd one happened in practice or not.

I'm happy to answer further questions, of course. If you have concrete examples in mind even better as it's easier to communicate I think.

Thanks a lot for the explanation and the quick response...

  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?

Yep, same as the previous. But as long as the address is not used for a widened load/store.

Let me clarify... The GEP will feed a masked load/store. We won't load the data in that out-of-bounds address (masked out) but that address will be used as base for the masked load/store. It sounds like a case similar to the one exposing the bug. Since the address will be used, I understand we should drop the inbounds.

What about vector instructions feeding a vector GEP feeding a gather/scatter?

That depends. Was the original program using that same value? If so, no problem. Otherwise I need a concrete example as I can't imagine one right away :)

Gathers/scatters are modeled as a vector of pointers using a vector GEP so if an address is poison it will be masked out, at least, initially. Unfortunately, some backends will turn this vector of pointers into a single base pointer + offsets. If the base pointer is poison, we have the same issue as with masked loads/stores. This is basically a vector variant of that problem. I would suggest dropping the flags also here, following the same logic. Does it make sense?

The problem with floats with signaling NaNs I guess. It wouldn't be correct to execute those speculatively. But that's a totally different problem.
If it's just about FMF producing poison, then no worries and that patch doesn't seem necessary (because if something is broken, it's a much bigger problem).

Ok.

Maybe we could consider dropping all the flags in instructions involved in address computation. Would that be reasonable?

If by that you mean dropping flags from instructions whose values flow into addresses of widened load/store operations, yes! That's a good way of fixing the 1st issue.
Unless you know something about the mask. If the 1st lane of every iteration is not masked out, you don't need to do anything as that value would be dereferenced in the original program.

Ok, I can add a check that looks for a widen load/store in the def-use chain.

Thanks!

Thanks a lot for the explanation and the quick response...

  • Should a GEP instruction keep the inbounds flag if after vectorization the computed address is actually out-of-bounds but out-of-bounds elements are masked out by the consumer masked load/store/gather/scatter?

Yep, same as the previous. But as long as the address is not used for a widened load/store.

Let me clarify... The GEP will feed a masked load/store. We won't load the data in that out-of-bounds address (masked out) but that address will be used as base for the masked load/store. It sounds like a case similar to the one exposing the bug. Since the address will be used, I understand we should drop the inbounds.

A masked load of a masked out lane is a NOP essentially, so the address can be poison as it's not actually used. You don't need to drop inbounds.
LangRef doesn't require the address of masked out lanes to be aligned, as in e.g. memcpy that even with size=0 the address must be properly aligned.

What about vector instructions feeding a vector GEP feeding a gather/scatter?

That depends. Was the original program using that same value? If so, no problem. Otherwise I need a concrete example as I can't imagine one right away :)

Gathers/scatters are modeled as a vector of pointers using a vector GEP so if an address is poison it will be masked out, at least, initially. Unfortunately, some backends will turn this vector of pointers into a single base pointer + offsets. If the base pointer is poison, we have the same issue as with masked loads/stores. This is basically a vector variant of that problem. I would suggest dropping the flags also here, following the same logic. Does it make sense?

Sounds like the same problem, yes. But just dropping flags isn't enough, because similarly the base pointer could have been poison already.
Those backends are buggy. Fixing requires either proving the base pointer isn't poison in the fist place, or deriving a base address from a non-masked out lane (probably the easiest solution?).

Let me clarify... The GEP will feed a masked load/store. We won't load the data in that out-of-bounds address (masked out) but that address will be used as base for the masked load/store. It sounds like a case similar to the one exposing the bug. Since the address will be used, I understand we should drop the inbounds.

A masked load of a masked out lane is a NOP essentially, so the address can be poison as it's not actually used. You don't need to drop inbounds.
LangRef doesn't require the address of masked out lanes to be aligned, as in e.g. memcpy that even with size=0 the address must be properly aligned.

Ok, let me update the patch so that we can discuss on actual examples. I have the impression that we both agree but might be looking at it from a different perspective.

Gathers/scatters are modeled as a vector of pointers using a vector GEP so if an address is poison it will be masked out, at least, initially. Unfortunately, some backends will turn this vector of pointers into a single base pointer + offsets. If the base pointer is poison, we have the same issue as with masked loads/stores. This is basically a vector variant of that problem. I would suggest dropping the flags also here, following the same logic. Does it make sense?

Sounds like the same problem, yes. But just dropping flags isn't enough, because similarly the base pointer could have been poison already.
Those backends are buggy. Fixing requires either proving the base pointer isn't poison in the fist place, or deriving a base address from a non-masked out lane (probably the easiest solution?).

Deriving a base address from a non-masked out lane makes sense but we don't always have that information at compile time. In any case, this patch is a good starting point to fix the backend problems. We can iterate on it later as we see fit.

dcaballe updated this revision to Diff 383687.Oct 31 2021, 3:01 PM
  • Added check to make sure we only drop poison-generating flags from instructions contributing to the address computation of masked loads/stores.
  • Removed logic to drop flags from widen GEPs (for gathers/scatters)
  • Removed logic to drop flags from all the widen instructions.
  • Reverted changes in impacted tests.
nlopes added a comment.Nov 3 2021, 6:52 AM

Looks good to me. Added a couple of suggestions to the code.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3138

isn't it inefficient to recompute this information for every operation? You get a O(n^2) behavior (up to n calls of this function, each traversing up to n instructions)

3145

An address computation may depend on float operations. e.g. you may have a float -> int, a select based on a float comparison, etc.

3160

you don't need to consider users of loads (nor of stores). you can break here if you want if WidenMemRec != nullptr.

dcaballe updated this revision to Diff 384745.Nov 4 2021, 7:17 AM
dcaballe marked 2 inline comments as done.

Addressed feedback.
Changed the approach to improve the complexity of detecting the
poison-generating recipes. We now gather these recipes before
executing the VPlan, starting from the recipes generating a widen
load/store and traversing the backward slice from their address
operand. In this way, recipes are only visited once.

Please, let me know if there is any other comment.

Thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3138

I changed the approach so that we gather the target recipes before executing the VPlan.

3145

True, removed that check.

3160

It makes sense. Done. Updating doc accordingly.

nlopes accepted this revision.Nov 4 2021, 12:26 PM

looks correct to me. Left just one perf suggestion.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1195

this visited set can be hoisted outside the lambda. You should only traverse each instruction at most once, as you need to drop poison flags if the instructions contributes to any address (doesn't matter which)

fhahn added a comment.Nov 7 2021, 1:16 AM

Thanks for the update! I think the description/title would need a minor update, as they still talk about 'integer' poison generating flags.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
888

This is only used during codegen, right? In that case, it would probably be better to just add it to VPTransformState instead. Otherwise it just adds a little bit of additional coupling between ILV and codegen, making it harder to separate in the future.

1219

Is there a way to avoid querying information from legal here?

As we are interested in any operation contributing to an address computation of a masked memory operation, could we only seed the worklist with addresses of masked operations instead (both VPWidenMemoryInstructionRecipe & VPInterleaveRecipe have hasMask/getMask helpers)? Then we would not need to check legal here, right?

Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.

1225

Do we in some cases also need to drop flags from recipes other then VPReplicateRecipe?

If we need to drop flags from recipes in unpredicted blocks (see comment above). then we could have cases where a vector operation (e.g. VPWidenRecipe) feeds the address computation (the first lane will get extracted). An example could be

define void @drop_vector(float* noalias nocapture readonly %input,
                                 float* %output, i64* noalias %ints) local_unnamed_addr #0 {
entry:
  br label %loop.header

loop.header:
  %iv = phi i64 [ 0, %entry ], [ %iv.inc, %if.end ]
  %i23 = icmp eq i64 %iv, 0
  %gep = getelementptr inbounds i64, i64* %ints, i64 %iv
  %lv = load i64, i64* %gep
  %i27 = sub nuw nsw i64 %iv, 1
  %r = add i64 %lv, %i27
  store i64 %r, i64* %gep
  br i1 %i23, label %if.end, label %if.then

if.then:
  %i29 = getelementptr inbounds float, float* %input, i64 %i27
  %i30 = load float, float* %i29, align 4, !invariant.load !0
  br label %if.end

if.end:
  %i34 = phi float [ 0.000000e+00, %loop.header ], [ %i30, %if.then ]
  %i35 = getelementptr inbounds float, float* %output, i64 %iv
  store float %i34, float* %i35, align 4
  %iv.inc = add nuw nsw i64 %iv, 1
  %exitcond = icmp eq i64 %iv.inc, 4
  br i1 %exitcond, label %loop.exit, label %loop.header

loop.exit:
  ret void
}
attributes #0 = { noinline nounwind uwtable "target-features"="+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl" }

!0 = !{}

Thanks for the feedback, Florian! I added some comments. Working on some of the changes while waiting for some clarifications. Thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
888

Sure, will do!

1195

It makes sense. Will do.

1219

As we are interested in any operation contributing to an address computation of a masked memory operation, could we only seed the worklist with addresses of masked operations instead (both VPWidenMemoryInstructionRecipe & VPInterleaveRecipe have hasMask/getMask helpers)? Then we would not need to check legal here, right?

I had tried using getMask already. Unfortunately, there are cases where the loads are predicated but the mask has been dropped in favor of executing them speculatively unmasked. See https://reviews.llvm.org/D66688 and test load-deref-pred.ll. We should drop the flags from these cases as well and the only way I found is using Legal->blockNeedsPredication. I know it's not ideal but Legal is still widely used during codegen.

Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.

I think that makes sense. It would a case similar to the one that exposed the bug but moving the subi instruction outside the condition. I'll give it a try!

1225

I assumed that address computation would be uniform for those cases and left scalar, wouldn't it?. I'll give it a try but there has been some back and forth wrt dropping the flags from vector instructions already. Based on previous feedback, I'm not sure if we should drop the flags from them when there are lanes that won't generate poison. I would need some clarification before proceeding with those changes (@nlopes).

dcaballe added inline comments.Nov 7 2021, 3:00 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1225

Looking closely at your example, the poison value generated by 'sub' in the first iteration of the loop is actually used in the scalar version of the code (it is stored). For that reason, my impression is that having the nuw flag in the input sub would be incorrect, isn't that the case?

dcaballe updated this revision to Diff 385340.Nov 7 2021, 6:32 AM
  • Added support for non-predicated poison-generating instruction cases.
  • Move MayGeneratePoisonRecipes to VPTransformState.
  • Update documentation.
dcaballe updated this revision to Diff 385341.Nov 7 2021, 6:43 AM
dcaballe marked 3 inline comments as done.

Moved visited set outside of lambda function

dcaballe added inline comments.Nov 7 2021, 6:46 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1219

Using getMask can't be used for the interleave recipe either. There are interleave groups that are formed out of non-predicated loads/stores. However, the resulting interleave group may still require a mask if the to-be-generated widen load reads more elements than those needed for the group. We should not drop poison flags in those cases.

dcaballe edited the summary of this revision. (Show Details)Nov 7 2021, 7:58 AM
nlopes added inline comments.Nov 7 2021, 12:00 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1225

I mentioned this issue before: what happens if a value was already poison in the first place?

I suggest you commit this patch first, as it addresses half of the problem, and then we can discuss what's the best way to fix the second part. Just dropping flags from even all instructions within the function isn't sufficient as you may get a poison value as input (change the example above to have [%arg, entry ] rather than [ 0, entry ] for %iv. Though you only need to patch the code for the cases where you cannot prove the code would execute in the scalar version. The right fix isn't trivial.

dcaballe updated this revision to Diff 385436.Nov 8 2021, 3:08 AM

Fixed failing test

Ok, I'll wait for Florian's approval to land this patch.

Thanks!
Diego

fhahn added inline comments.Nov 8 2021, 1:49 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1219

I had tried using getMask already. Unfortunately, there are cases where the loads are predicated but the mask has been dropped in favor of executing them speculatively unmasked. See https://reviews.llvm.org/D66688 and test load-deref-pred.ll. We should drop the flags from these cases as well and the only way I found is using Legal->blockNeedsPredication. I know it's not ideal but Legal is still widely used during codegen.

Hmm, that's unfortunate! There are still plenty of uses of Legal, but each new use we add makes things more difficult to transition. Let me take a look if there's another alternative, but if there's none it's not the end of the world for now.

1225

I tried reading the discussion again, but I'm not sure why it would matter whether the sub gets widened to a vector or not for the test case. In the test case, the issue should still be the same as in the motivating test case, the flags on the instruction can poison (some) vector lanes, independent of whether the inputs were poison originally.

In the example, the sub nuw nsw gets widened, but similar to the other test cases, the first lane is used in the address computation of the masked memory operation. Effectively I am not sure I see the difference whether we compute only the first lane (because we scalarized the sub) or if we compute all lanes and use the first lane only in a UB-on-poison op and have other non-UB-on-poison uses of the full vector (storing a vector with (some) poison lanes should not be UB).

If we only restrict this to VPReplicateRecipe, it seems like we can easily run into slight variations of the motivating test case where we still have the same issue as the patch is trying to fix.

Now if we support non-block dropping flags from non-BB-local instruction/recipes, there are cases where we may not need to drop flags, e.g. because we can prove that poison generated by the instruction cause UB unconditionally. In those cases, I think for now our priority should be correctness, as it’s unlikely for poison-generating flags to make a notable difference for the vector loops during codegen.

1249

I think for consistency with the LLVM style used here variables should start with an upper case letter (https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly)

llvm/test/Transforms/LoopVectorize/X86/drop-poison-generating-flags.ll
82–83

Should nuw nsw be retained in the source here and the same for inbounds. It looks like they got dropped.

dcaballe added inline comments.Nov 9 2021, 2:33 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1219

Sure! Note that blockNeedsPredication is used in quite a few places so all of them should be addressed in the same way.

1225

Semantics on poison-generating flags on vector instructions are not clear to me, even after the discussion. I added support for those cases initially, then removed it... It should be easy to add them again but I want to make sure we agree on that before introducing any changes. If those cases need further discussion I'll be happy to address them in a separate commit. I'm not too concerned about those cases since I don't think LLVM can replace the value of a specific lane with a poison value at this point.

Now if we support non-block dropping flags from non-BB-local instruction/recipes, there are cases where we may not need to drop flags, e.g. because we can prove that poison generated by the instruction cause UB unconditionally. In those cases, I think for now our priority should be correctness, as it’s unlikely for poison-generating flags to make a notable difference for the vector loops during codegen.

Not sure I fully understand this part but the key point, as discussed previously, is the usage of the poison value. We may have an instruction that may generates a poison value unconditionally. It should be ok as long as the potential poison value is not used. We should drop the flags if the potential poison value happens to be used after vectorization. That's exactly what the latest implementation is doing.

llvm/test/Transforms/LoopVectorize/X86/drop-poison-generating-flags.ll
82–83

This is the test you asked me to add, isn't it?

Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.`

sub and getelementptr are not predicated but feed a load that is predicated.

dcaballe updated this revision to Diff 385838.Nov 9 2021, 8:27 AM

Fixed naming convention issues

dcaballe updated this revision to Diff 385902.Nov 9 2021, 11:01 AM

Since it was already implemented in a previous version of the code, I restored the logic to drop flags
from Widen and WidenGEP recipes following Florian's suggestion. These changes only impact Florian's
test case. I modified the test case to also have a vector GEP feeding a masked load to also cover
that use case.

Let me know if you have any further comments.
Thanks

Any other comments? :)

Thanks for the update! I left a few small remaining comments. Basically LGTM, but it would still be good to hear @nlopes thoughts on the recent responses with respect to widened instructions.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1213

I think it would be helpful to add a comment spelling out what we are looking for here.

1216

I'm not entirely sure why we need to check the parent region/replicator? It would be great if you could include the reasoning in the comment above, or remove if it is not needed.

1231

nit: it might be more explicit to pass the plan as argument instead of accessing it from State.

dcaballe updated this revision to Diff 387402.Nov 15 2021, 2:13 PM
dcaballe marked 2 inline comments as done.
  • Updated and added more comments.
  • Simplified redundant condition.

Thanks, Florian! I addressed the feedback.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1216

Good catch! Actually most of these conditions are redundant or irrelevant in the new algorithm since we already know this recipe is contributing to the address computation of a VPWidenMemoryRecipe or VPInterleaveRecipe. That simplifies a lot the condition. Thanks!

1231

We are also using other fields from State here, like VF and MayGeneratePoisonRecipes and this is the only used of Plan. I'm not sure if it's worth it.

I talked to Nuno in private and he mentioned that I could go ahead and commit the changes and address any minor feedback in a separate commit since he is very busy right now.
I'll commit this on Monday if no other comments.

Thank you all for the feedback!
Diego

fhahn accepted this revision.Nov 21 2021, 1:58 PM

I talked to Nuno in private and he mentioned that I could go ahead and commit the changes and address any minor feedback in a separate commit since he is very busy right now.
I'll commit this on Monday if no other comments.

Thank you all for the feedback!
Diego

Sounds good to me! LGTM, thanks.

This patch already stretches a bit my knowledge of the vectorizer code, but the examples look good.
Thanks @fhahn for the test case. And apologies for the delay in getting back to this; I had a deadline Friday and was quite busy.

We should fuzz this thing. I'm not super comfortable still. I've a few theoretical concerns, but I don't know if they trigger in practice.

Thanks for all the feedback! I think this will cover a good part of the cases that could actually go wrong in practice. We can accommodate more cases as we see fit.