Page MenuHomePhabricator

[InstCombine] Disable optimizations of select instructions that causes propagation of poison values
Needs ReviewPublic

Authored by congzhe on Dec 10 2020, 1:24 PM.

Details

Summary

This is a work-in-progress.

Relevant discussions on Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=48353

In brief (cited from Roman Lebedev),


define i1 @src(i1 %cmp.i, i1 %cmp41) {
%entry:
  %cmp4 = select i1 %cmp.i, i1 1, i1 %cmp41
  ret i1 %cmp4
}

->

define i1 @tgt(i1 %cmp.i, i1 %cmp41) {
%entry:
  %cmp4 = or i1 %cmp.i, %cmp41
  ret i1 %cmp4
}

Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
i1 %cmp.i = #x1 (1)
i1 %cmp41 = poison

Source:
i1 %cmp4 = #x1 (1)

Target:
i1 %cmp4 = poison
Source value: #x1 (1)

Target value: poison


Therefore, this patches disabled all such optimizations of select instructions that result in propagation of poison values.

In addition, minor changes are also made to ensure there is no functional problem after disabling the abovementioned optimizations.

Performance measurement should be taken to measure potential degradations. Currently in progress. Many regression tests should be updated as well.

Diff Detail

Event Timeline

congzhe created this revision.Dec 10 2020, 1:24 PM
congzhe requested review of this revision.Dec 10 2020, 1:24 PM
congzhe edited the summary of this revision. (Show Details)Dec 10 2020, 1:28 PM
congzhe edited the summary of this revision. (Show Details)
aqjune added inline comments.Dec 11 2020, 12:30 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2643

Is this code still reachable when TrueVal/FalseVal are vectors of i1?

congzhe updated this revision to Diff 311362.Dec 11 2020, 8:56 PM

Updated to address comments.

congzhe added inline comments.Dec 11 2020, 8:58 PM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2643

Thanks for the comment! Now updated to handle vectors of i1 as well.

Performance data? No regressions for benchmarks?

nikic requested changes to this revision.Dec 12 2020, 1:17 AM
nikic added a subscriber: nikic.

This will require other parts of the compiler (especially anything dealing with and/or'd branch conditions, like SCEV, LVI, ValueTracking etc) to understand this new form of and/or first. I don't think most optimizations care about whether and/or is poison-blocking or not, but they need to recognize the select form if we don't canonicalize.

This revision now requires changes to proceed.Dec 12 2020, 1:17 AM

You will also need to update several regression tests (make check or ninja check) to show new canonical form.
I'm still not sure which path (this or insert freezes) is easier/better - if we have perf data, maybe it will tell us how to go?

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2640

This seems more complicated than necessary. Can't we just transfer the check from the block above to here:

if (SelType->isIntOrIntVectorTy() && !SelType->isIntOrIntVectorTy(1) && ...)

We can do that as an NFC preliminary patch regardless of whether we decide to proceed with this or not.

Performance data? No regressions for benchmarks?

Thanks for the comment! As mentioned in the summary, the performance testing is currently in progress.

This will require other parts of the compiler (especially anything dealing with and/or'd branch conditions, like SCEV, LVI, ValueTracking etc) to understand this new form of and/or first. I don't think most optimizations care about whether and/or is poison-blocking or not, but they need to recognize the select form if we don't canonicalize.

Thank you, I do agree with you that some infrastructures like SCEV might be impacted. Could you clarify it a bit? I am working on unit tests and perf degradation. Should the cause of any of them be those infrastructures, I will post fixes. Would that address your concern? Or something more than that is needed?"

You will also need to update several regression tests (make check or ninja check) to show new canonical form.
I'm still not sure which path (this or insert freezes) is easier/better - if we have perf data, maybe it will tell us how to go?

Thanks Sanjay, by "perf data" I'm wondering which benchmarks does the community mostly care about? LLVM test-suite?

As you said there's two ways to address this bug -- one is this patch, the other is inserting freezes. With this patch I do see degradations on some internal benchmarks and I'm working on the fix. Do you think it is the right track? I just would like to make sure before going too far.

The work I do is on AArch64. It may or may not impact other architectures (X86, PowerPC, etc). I'm wondering what is the typical way in the community to address performance degradations since addressing degradations for all possible architectures seems significant amount of work hence not likely?

Current failing regression tests:

Failed Tests (10):
  Clang :: CodeGenOpenCL/amdgpu-nullptr.cl
  LLVM :: Transforms/InstCombine/icmp.ll
  LLVM :: Transforms/InstCombine/minmax-fold.ll
  LLVM :: Transforms/InstCombine/select-bitext.ll
  LLVM :: Transforms/InstCombine/select-cmp-br.ll
  LLVM :: Transforms/InstCombine/select.ll
  LLVM :: Transforms/InstCombine/select_meta.ll
  LLVM :: Transforms/PGOProfile/chr.ll
  LLVM :: Transforms/PhaseOrdering/X86/vector-reductions.ll
  LLVM :: Transforms/PhaseOrdering/unsigned-multiply-overflow-check.ll
congzhe updated this revision to Diff 311691.Dec 14 2020, 1:36 PM
congzhe added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2640

Thanks for the comment, now updated accordingly and will post this piece of code as another separate patch.

This will require other parts of the compiler (especially anything dealing with and/or'd branch conditions, like SCEV, LVI, ValueTracking etc) to understand this new form of and/or first. I don't think most optimizations care about whether and/or is poison-blocking or not, but they need to recognize the select form if we don't canonicalize.

Thank you, I do agree with you that some infrastructures like SCEV might be impacted. Could you clarify it a bit? I am working on unit tests and perf degradation. Should the cause of any of them be those infrastructures, I will post fixes. Would that address your concern? Or something more than that is needed?"

Thanks Sanjay, by "perf data" I'm wondering which benchmarks does the community mostly care about? LLVM test-suite?

We would get a different answer for each person/company, but test-suite is solid common ground. I think SPEC is also widely used by a large part of the community.

As you said there's two ways to address this bug -- one is this patch, the other is inserting freezes. With this patch I do see degradations on some internal benchmarks and I'm working on the fix. Do you think it is the right track? I just would like to make sure before going too far.

The work I do is on AArch64. It may or may not impact other architectures (X86, PowerPC, etc). I'm wondering what is the typical way in the community to address performance degradations since addressing degradations for all possible architectures seems significant amount of work hence not likely?

In general, we try to address known regressions in advance. If you see a problem on AArch64, it will probably not be too different for the other CPU targets.
It is acknowledged that it is impossible to know how a change like this will affect everything, so as long as we have a plan to deal with problems, it's fine to proceed.
My suggestion is to compare the regressions between this patch vs. adding freeze on test-suite. Is there a clear winner (number and average size of regressions)?
Thank you for pushing this forward!

This will require other parts of the compiler (especially anything dealing with and/or'd branch conditions, like SCEV, LVI, ValueTracking etc) to understand this new form of and/or first. I don't think most optimizations care about whether and/or is poison-blocking or not, but they need to recognize the select form if we don't canonicalize.

Thank you, I do agree with you that some infrastructures like SCEV might be impacted. Could you clarify it a bit? I am working on unit tests and perf degradation. Should the cause of any of them be those infrastructures, I will post fixes. Would that address your concern? Or something more than that is needed?"

Thanks Sanjay, by "perf data" I'm wondering which benchmarks does the community mostly care about? LLVM test-suite?

We would get a different answer for each person/company, but test-suite is solid common ground. I think SPEC is also widely used by a large part of the community.

As you said there's two ways to address this bug -- one is this patch, the other is inserting freezes. With this patch I do see degradations on some internal benchmarks and I'm working on the fix. Do you think it is the right track? I just would like to make sure before going too far.

The work I do is on AArch64. It may or may not impact other architectures (X86, PowerPC, etc). I'm wondering what is the typical way in the community to address performance degradations since addressing degradations for all possible architectures seems significant amount of work hence not likely?

In general, we try to address known regressions in advance. If you see a problem on AArch64, it will probably not be too different for the other CPU targets.
It is acknowledged that it is impossible to know how a change like this will affect everything, so as long as we have a plan to deal with problems, it's fine to proceed.
My suggestion is to compare the regressions between this patch vs. adding freeze on test-suite. Is there a clear winner (number and average size of regressions)?
Thank you for pushing this forward!

On test suite it seems we did not have a clear winner.

Out of ~3600 microbenchmarks I choose the ones with running time greater than 4us, that gives me ~250 benchmarks. Regarding the overall geometric mean difference to llvm trunk code, both the freeze solution and this patch have less than 0.7% difference to trunk. We have 9 microbenchmarks that degrades over 2% for the freeze solution where the geomean for those 9 degradations is 3.7%; for this patch we have 12 microbenchmarks that degrades over 2% for the freeze solution where the geomean for those 12 degradations is 4.3%.

If I choose the microbenchmarks with running time greater than 0.5us, that gives me ~400 microbenchmarks where I have similar observations.

However, on internal benchmarks I see much less degradation from the freeze solution compared to this patch.

Does this sound like a clue that we should go ahead with the freeze solution? If so I'll post the freeze solution and see how we can go from there.

However, on internal benchmarks I see much less degradation from the freeze solution compared to this patch.

Does this sound like a clue that we should go ahead with the freeze solution? If so I'll post the freeze solution and see how we can go from there.

Ok, the freeze solution might be easier to implement. It seems more straightforward to teach passes to peek through a freeze vs. learn new patterns? I defer to @aqjune and @nikic and other reviewers for opinions/experience. I haven't done that work myself.

Using freeze loses information (if some of the inputs was poison). Plus It requires an extra op.
If we canonicalize around select there's no loss of information and it's just 1 instruction.

The disadvantage is that then we have 2 ways or doing boolean ANDs/ORs. Though most analyses can be patched easily, as most LLVM analyses' results are of the form "x has property foo unless it's poison". So for those analyses using and/or or select is the same (as the only difference between these is propagation of poison).
Other analyses/optimization can learn about select as needed.

Using freeze loses information (if some of the inputs was poison). Plus It requires an extra op.
If we canonicalize around select there's no loss of information and it's just 1 instruction.

The disadvantage is that then we have 2 ways or doing boolean ANDs/ORs. Though most analyses can be patched easily, as most LLVM analyses' results are of the form "x has property foo unless it's poison". So for those analyses using and/or or select is the same (as the only difference between these is propagation of poison).
Other analyses/optimization can learn about select as needed.

Thank you for raising up the good point! I understand that we lose information by preventing poison values from propagation using freeze. But I'm unclear what would be the side effect or problem with that? I'd appreciate it if you could clarify a bit, thanks!

Using freeze loses information (if some of the inputs was poison). Plus It requires an extra op.
If we canonicalize around select there's no loss of information and it's just 1 instruction.

The disadvantage is that then we have 2 ways or doing boolean ANDs/ORs. Though most analyses can be patched easily, as most LLVM analyses' results are of the form "x has property foo unless it's poison". So for those analyses using and/or or select is the same (as the only difference between these is propagation of poison).
Other analyses/optimization can learn about select as needed.

Thank you for raising up the good point! I understand that we lose information by preventing poison values from propagation using freeze. But I'm unclear what would be the side effect or problem with that? I'd appreciate it if you could clarify a bit, thanks!

In practice, probably not a lot. But it may have implications for loop optimization, like:

for (i=0; some_bool && i < limit; ++i) {
...
}

If you remove the poison from the i+1 < limit bit it may make the work of SCEV harder (or impossible; didn't think the example through carefully).
Another example is hoisting of sext(i32)->i64 out of loops. This is important for perf and it relies on poison to prove no wrapping and we can't destroy that. Probably most loop conditions don't have ANDs/ORs, hence why I say the impact in practice is small.

for (i=0; some_bool && i < limit; ++i) {
...
}

Assuming that all relevant optimizations are fully implemented in both scenarios (and/or with freeze vs. select), I think their performance diff should be zero because C/C++ constrains variables to have well-defined values.
Currently, this isn't enforced by clang. We already have !noundef, and simply attaching it to loads will do the trick.
Should we move toward the direction and suggest it to llvm-dev? The diff in clang src won't be big, but people might worry about their program being more undefined (which are written in that way due to a practical reason).

Using freeze loses information (if some of the inputs was poison). Plus It requires an extra op.
If we canonicalize around select there's no loss of information and it's just 1 instruction.

The disadvantage is that then we have 2 ways or doing boolean ANDs/ORs. Though most analyses can be patched easily, as most LLVM analyses' results are of the form "x has property foo unless it's poison". So for those analyses using and/or or select is the same (as the only difference between these is propagation of poison).
Other analyses/optimization can learn about select as needed.

Thank you for raising up the good point! I understand that we lose information by preventing poison values from propagation using freeze. But I'm unclear what would be the side effect or problem with that? I'd appreciate it if you could clarify a bit, thanks!

In practice, probably not a lot. But it may have implications for loop optimization, like:

for (i=0; some_bool && i < limit; ++i) {
...
}

If you remove the poison from the i+1 < limit bit it may make the work of SCEV harder (or impossible; didn't think the example through carefully).

Can I just make sure my understanding is correct -- so when we check the SCEV of some_bool && i < limit; we do recursion backwards on this select instruction (after this patch) or on an AND instruction (before this patch). If we choose the freeze approach, we'll do recursion on the AND instruction and eventually hit a freeze instruction which SCEV does not know how to handle, hence SCEV will just return CouldNotCompute?

In practice, probably not a lot. But it may have implications for loop optimization, like:

for (i=0; some_bool && i < limit; ++i) {
...
}

If you remove the poison from the i+1 < limit bit it may make the work of SCEV harder (or impossible; didn't think the example through carefully).

Can I just make sure my understanding is correct -- so when we check the SCEV of some_bool && i < limit; we do recursion backwards on this select instruction (after this patch) or on an AND instruction (before this patch). If we choose the freeze approach, we'll do recursion on the AND instruction and eventually hit a freeze instruction which SCEV does not know how to handle, hence SCEV will just return CouldNotCompute?

It's more than just teaching SCEV how to handle freeze. A freeze instruction destroys UB and therefore not even the most precise SCEV would be able to recover the information that is lost. select retains more information/UB.
So if SCEV requires poison from the add nsw to e.g. determine the number of iterations, we would miss out with the freeze implementation. It's a hypothetical loss that I can't quantify without testing. My argument for using select is just that it makes the IR smaller and therefore preferable. Since it has the nice extra benefit of keeping more UB, then great, but not the main argument.

Assuming that all relevant optimizations are fully implemented in both scenarios (and/or with freeze vs. select), I think their performance diff should be zero because C/C++ constrains variables to have well-defined values.
Currently, this isn't enforced by clang. We already have !noundef, and simply attaching it to loads will do the trick.

Thinking about this again, this isn't true. Since !noundef can be stripped when the load is hoisted, there still is a possible UB loss.
So, in terms of preserving UB, using select will be the better approach.

But, there are so many transformations to consider if select is used; from InstCombine optimizations that involve And/Or to analysis, simplifycfg, etc. It seems to fix all of them is a hard job.

Assuming that all relevant optimizations are fully implemented in both scenarios (and/or with freeze vs. select), I think their performance diff should be zero because C/C++ constrains variables to have well-defined values.
Currently, this isn't enforced by clang. We already have !noundef, and simply attaching it to loads will do the trick.

Thinking about this again, this isn't true. Since !noundef can be stripped when the load is hoisted, there still is a possible UB loss.
So, in terms of preserving UB, using select will be the better approach.

But, there are so many transformations to consider if select is used; from InstCombine optimizations that involve And/Or to analysis, simplifycfg, etc. It seems to fix all of them is a hard job.

I think we'll just have to bite the bullet and do that. In a systematic way, before this lands. I think that the number of places that need to be changed is actually rather limited as we're only interested in and/or on booleans. We don't have that many analyses dealing with conditions.

Here's a sample patch for LVI: https://gist.github.com/nikic/fac8124a95901b7ff9ac44513f554578 With this PatternMatch boilerplate in place, the actual LVI change is trivial. The main work is adding relevant tests...

Worth noting is that some of the code deleted in this patch should be retained as a canonicalization. E.g. a ? a : b should become a ? true : b, a ? b : true should become !a ? true : b, etc. That way the remaining code only has to deal with canonical select patterns.

It may also be worthwhile to convert the special cases of icmp (a, X) ? (icmp a, Y) : false to icmp (a, X) & (icmp a, Y), though I'm not sure if that is even useful if we handle the select pattern reasonable everywhere else.

I think we'll just have to bite the bullet and do that. In a systematic way, before this lands. I think that the number of places that need to be changed is actually rather limited as we're only interested in and/or on booleans. We don't have that many analyses dealing with conditions.

+1

Worth noting is that some of the code deleted in this patch should be retained as a canonicalization. E.g. a ? a : b should become a ? true : b, a ? b : true should become !a ? true : b, etc. That way the remaining code only has to deal with canonical select patterns.

Yes, this makes sense to me.

It may also be worthwhile to convert the special cases of icmp (a, X) ? (icmp a, Y) : false to icmp (a, X) & (icmp a, Y), though I'm not sure if that is even useful if we handle the select pattern reasonable everywhere else.

I think this is helpful because it helps the expression syntactically show more information. Poison propagation is more aggressive in the latter case.
If we want to support this transformation, I think having something like impliesPoison in ValueTracking and calling it when doing this conversion will be helpful. I locally have an old patch that does this (maybe somewhere in Phabricator too).
What do you think about this workflow:
(1) Add the LogicalOr/And pattern match, make a few important analysis (LVI, transformations in ValueTracking) as well transformations like GVN support select pattern.
(2) If necessary, add an analysis function that can be used for conditionally allowing select->and/or.
(3) See how the performance goes..!

Relevant patches in the past that partially allows select -> and/or: https://reviews.llvm.org/D77868 , https://reviews.llvm.org/D78152

I wanted to make ScalarEvolution recognize select pattern, and became a bit uncertain about its validity.

Take this example: https://alive2.llvm.org/ce/z/NsP9ue
SCEV's computeExitLimit can return min(n, m) as ExactNotTaken value, so I put llvm.assume to show its validity.
But it fails because the exit limit becomes poison if n is zero and m is poison. This will make e.g. replacing the last value of i with min(n, m) invalid.
If and is used instead, this becomes okay: https://alive2.llvm.org/ce/z/K9rbJk

If there is a guard about n and m at the loop header it would be safe I think. Is this what SCEV assumes? Then, how should the input look like?

nikic added a comment.Mon, Dec 28, 3:41 AM

I wanted to make ScalarEvolution recognize select pattern, and became a bit uncertain about its validity.

Take this example: https://alive2.llvm.org/ce/z/NsP9ue
SCEV's computeExitLimit can return min(n, m) as ExactNotTaken value, so I put llvm.assume to show its validity.
But it fails because the exit limit becomes poison if n is zero and m is poison. This will make e.g. replacing the last value of i with min(n, m) invalid.
If and is used instead, this becomes okay: https://alive2.llvm.org/ce/z/K9rbJk

If there is a guard about n and m at the loop header it would be safe I think. Is this what SCEV assumes? Then, how should the input look like?

I think you are right about this. SCEV doesn't really have the facilities to represent this. I think the only thing we can do at the SCEV level is to at least determine that the first condition implies an upper bound, even if we can't use the second one. Guess this is more motivation to try fairly hard to convert select -> and/or for the cases where we can do so.

I wanted to make ScalarEvolution recognize select pattern, and became a bit uncertain about its validity.

Take this example: https://alive2.llvm.org/ce/z/NsP9ue
SCEV's computeExitLimit can return min(n, m) as ExactNotTaken value, so I put llvm.assume to show its validity.
But it fails because the exit limit becomes poison if n is zero and m is poison. This will make e.g. replacing the last value of i with min(n, m) invalid.
If and is used instead, this becomes okay: https://alive2.llvm.org/ce/z/K9rbJk

If there is a guard about n and m at the loop header it would be safe I think. Is this what SCEV assumes? Then, how should the input look like?

I think you are right about this. SCEV doesn't really have the facilities to represent this. I think the only thing we can do at the SCEV level is to at least determine that the first condition implies an upper bound, even if we can't use the second one. Guess this is more motivation to try fairly hard to convert select -> and/or for the cases where we can do so.

Okay.. let me make a patch for this.

Okay, I think the remaining optimizations (other than opened patches) are:

  • LoopUnswitch.cpp: I can address this
  • PredicateInfo.cpp: I'm not sure whether a local change will be enough, because the relevant classes in PredicateInfo.h are exposed to other optimizations.
  • SimpleLoopUnswitch.cpp, LoopPredication.cpp, InductiveRangeCheckElimination.cpp: have no idea

After LoopUnswitch and PredicateInfo is updated & the select -> and/or is conditionally allowed, we can try performance evaluation again and see whether there still exists regression. What do you think? @nikic @congzhe

Okay, I think the remaining optimizations (other than opened patches) are:

  • LoopUnswitch.cpp: I can address this
  • PredicateInfo.cpp: I'm not sure whether a local change will be enough, because the relevant classes in PredicateInfo.h are exposed to other optimizations.
  • SimpleLoopUnswitch.cpp, LoopPredication.cpp, InductiveRangeCheckElimination.cpp: have no idea

After LoopUnswitch and PredicateInfo is updated & the select -> and/or is conditionally allowed, we can try performance evaluation again and see whether there still exists regression. What do you think? @nikic @congzhe

Thank you all for putting all the efforts addressing the canonicalization of select form!
Sure I will apply all these patches, check the benchmark performance and let you know the result. Thanks again!

aqjune added a comment.Sun, Jan 3, 7:34 AM

For loop unswitch, I'll work on it after D93764 is done

nikic added a comment.Fri, Jan 8, 12:51 AM

@aqjune Do you have any automated way of duplicating the InstCombine tests with logical and/or, similar to what you did for insertelement? Or does this have to be done by hand?

@aqjune Do you have any automated way of duplicating the InstCombine tests with logical and/or, similar to what you did for insertelement? Or does this have to be done by hand?

Technically, it wouldn't be hard.
But I'm slightly concerned that the situation is slightly different from the vector's poison vector placeholder patches, because this is about disabling optimizations.
A mail should be sent to llvm-dev to notify this issue and people should be aware that performance degradation may happen.
Well, the root problem is that I don't have enough time for this for a while until the end of Jan. :( I don't want someone else to take over the responsibility either; this would be a lot of work.

I think the modest solution is to simply revert the poison constant folding patch (D92270), except a few cases like division. select -> and/or won't interact with poison from div undef because executing div undef already raises UB in src.
After the select -> and/or is removed, the patch should definitely be applied again,

nikic added a comment.Tue, Jan 19, 2:32 PM

Current diff for flipping the flag: https://gist.github.com/nikic/f65b36adb70c93f5da9bfe3422fd8904 There's still a number of cases that can be handled.

I have a patch for supporting logical and/or in PredicateInfo, but waiting on D94447 to land first.

I also noticed that SimplifyCFG is creating and/or when merging branch conditions in https://github.com/llvm/llvm-project/blob/ecf696641e6ce4b22e8c8ea3c7476b9c1f0f200b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L2959. I think it would make sense to switch that code over to use the select form first. That will make the transform correct in SimplifyCFG, but still let it be folded by InstCombine. As other passes have already been adjusted to recognize logical and/or, this should be low impact.

Current diff for flipping the flag: https://gist.github.com/nikic/f65b36adb70c93f5da9bfe3422fd8904 There's still a number of cases that can be handled.

I think I can handle or_andn_cmp_1_logical and its family, llvm.umul.with.overflow.i64 related transformations, and PR41069_logical.

I also noticed that SimplifyCFG is creating and/or when merging branch conditions in https://github.com/llvm/llvm-project/blob/ecf696641e6ce4b22e8c8ea3c7476b9c1f0f200b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L2959. I think it would make sense to switch that code over to use the select form first. That will make the transform correct in SimplifyCFG, but still let it be folded by InstCombine. As other passes have already been adjusted to recognize logical and/or, this should be low impact.

+1, this very makes sense to me; InstCombine happens immediately after SimplifyCFG in many places.
I believe non-InstCombine passes coming after SimplifyCFG are already familiar with unoptimized select instructions as well because SimplifyCFG is already generating such ops (see test6f)
I'll make a patch for this.