Page MenuHomePhabricator

[InstSimplify] LCSSA PHIs should not be simplified away
AbandonedPublic

Authored by alex-t on Jun 18 2019, 6:11 AM.

Details

Summary

We rely on LCSSA in case when the uniform value defined inside the loop with divergent exit is used outside the loop.
In some cases EarlyCSE eliminates LCSSA PHIs because of the algorithm used in SimplifyPHINode function.
It always assumes more then one incoming values and mistakenly removes LCSSA PHIs.
This causes generation of incorrect code in AMDGPU backend.

Diff Detail

Event Timeline

alex-t created this revision.Jun 18 2019, 6:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 18 2019, 6:11 AM
Herald added subscribers: hiraditya, tpr. · View Herald Transcript

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

lebedev.ri retitled this revision from LCSSA PHIs should not be simplified away to [InstSimplify] LCSSA PHIs should not be simplified away.Jun 18 2019, 6:54 AM

(Also missing tests)

alex-t updated this revision to Diff 205345.Jun 18 2019, 7:41 AM

MIR test added

alex-t added a comment.EditedJun 18 2019, 7:52 AM

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

Manually written IR w/o lcssa phis will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

The other condition should probably be TTI.hasBranchDivergence().

alex-t added a comment.EditedJun 18 2019, 10:19 AM

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

The other condition should probably be TTI.hasBranchDivergence().

Sure! Good point. Thanks.
Although adding target specific information to such a general part requires a lot of changes to pass TTI from the Pass level to all the calls.

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

arsenm added a subscriber: arsenm.Jun 18 2019, 11:16 AM

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

Nothing is invalid from the IR point of view. This is all a kludge to get divergence information into SelectionDAG. There needs to be an IR instruction at the use point for the DAG to query the divergence analysis

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

Okay. You are right for this concrete example. I also agree that LCSSA need to be simplified in trivial cases.
The problem is that this algorithm eliminates them in all cases. It looks like it was designed to detect PHIs with equal inputs and eliminates LCSSA unintentionally.
Anyway, I've got what is your objection about.

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

Nothing is invalid from the IR point of view. This is all a kludge to get divergence information into SelectionDAG. There needs to be an IR instruction at the use point for the DAG to query the divergence analysis

That means I have to change all the function declarations on the way to be able to pass TTI to check if it is the "divergent" target (

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

Nothing is invalid from the IR point of view. This is all a kludge to get divergence information into SelectionDAG. There needs to be an IR instruction at the use point for the DAG to query the divergence analysis

That means I have to change all the function declarations on the way to be able to pass TTI to check if it is the "divergent" target (

I don't think InstSimplify should depend on target information

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

Nothing is invalid from the IR point of view. This is all a kludge to get divergence information into SelectionDAG. There needs to be an IR instruction at the use point for the DAG to query the divergence analysis

That means I have to change all the function declarations on the way to be able to pass TTI to check if it is the "divergent" target (

I don't think InstSimplify should depend on target information

Can't you just schedule an LCSSAPass somewhere late in pipeline, before SelectionDAG and after early-cse?

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

Nothing is invalid from the IR point of view. This is all a kludge to get divergence information into SelectionDAG. There needs to be an IR instruction at the use point for the DAG to query the divergence analysis

That means I have to change all the function declarations on the way to be able to pass TTI to check if it is the "divergent" target (

I don't think InstSimplify should depend on target information

That's exactly why I don't like this idea as well.

This causes generation of incorrect code in AMDGPU backend.

This sounds like some other check is missing elsewhere?
What happens if you feed it such an ir as-if after this transform, but manually written?
("that will result in broken asm/crashes" is hopefully not the answer)

That being said, why is LCSSAPass not sufficient?
It's already supposed to undo transforms like this.

It will result in syntactically correct asm and no crashes. In runtime we'll get incorrect result though :)
Adding LCSSA pass again later on is difficult in the sense of the pass dependencies.
So, it's better to fix the explicit bug in SimplifyPHI....

Aha, so it's not -instsimplify pass itself, but how it's used during transition into backend.

  1. You certainly don't want to make this blacklist unconditional, it should still run when the -instsimplify pass itself is run. (+instsimplify test)
  2. How does this affect other targets (backends)? Does this need some TLI hook?

In fact I still insist that this is a bug in -instsimplify pass.
The algorithm is written in such a way that it always expect more then one input in PHI node.
That means that the person who's written it had no intention to really remove the LCSSA PHIs but was just unaware of their existence!
It is not correct to remove LCSSA in similar manner as PHI nodes with equal inputs are removed.
So, any other backend should not suffer from this change. It is not about the convenience for AMDGPU backend but about correctness.
Could you provide an example where LCSSA PHIs are added and then intentionally removed?

Can you please specify why this transform is invalid from LLVM IR point of view? https://godbolt.org/z/D8gKNc
In endloop BB, which has a single predecessor BB - loop, %counter.lcssa value can only be %counter value.

Nothing is invalid from the IR point of view. This is all a kludge to get divergence information into SelectionDAG. There needs to be an IR instruction at the use point for the DAG to query the divergence analysis

That means I have to change all the function declarations on the way to be able to pass TTI to check if it is the "divergent" target (

I don't think InstSimplify should depend on target information

Can't you just schedule an LCSSAPass somewhere late in pipeline, before SelectionDAG and after early-cse?

We tried but got a bunch of troubles: https://reviews.llvm.org/D60834

Since this problem is being encountered, i'm guessing that even without this workaround,
the same problem can still be encountered if the original IR is not in LCSSA form to begin with,
i.e. i think the D60834 sounds like the properer fix (ignoring the roadblocks on the way there.)

Since this problem is being encountered, i'm guessing that even without this workaround,
the same problem can still be encountered if the original IR is not in LCSSA form to begin with,
i.e. i think the D60834 sounds like the properer fix (ignoring the roadblocks on the way there.)

Agree - InstSimplify is a very early (but frequently used) target-independent (no TTI/TLI) canonicalization pass/analysis. I don't see why it should be crippled as proposed in this patch.
The problem is that target-independent IR canonicalization passes like EarlyCSE are running late in the IR codegen pipeline?

Also, this code change causes lots of existing middle-end test failures, so it doesn't appear to be some harmless/degenerate pattern:

Failing Tests (43):

LLVM :: Transforms/CodeGenPrepare/X86/computedgoto.ll
LLVM :: Transforms/Coroutines/ArgAddr.ll
LLVM :: Transforms/CorrelatedValuePropagation/select.ll
LLVM :: Transforms/GlobalDCE/complex-constantexpr.ll
LLVM :: Transforms/IndVarSimplify/rewrite-loop-exit-value.ll
LLVM :: Transforms/Inline/AArch64/phi.ll
LLVM :: Transforms/Inline/bfi-update.ll
LLVM :: Transforms/Inline/inline-fast-math-flags.ll
LLVM :: Transforms/Inline/inline_constprop.ll
LLVM :: Transforms/InstCombine/gep-combine-loop-invariant.ll
LLVM :: Transforms/InstCombine/gepphigep.ll
LLVM :: Transforms/InstMerge/st_sink_bugfix_22613.ll
LLVM :: Transforms/InstSimplify/phi.ll
LLVM :: Transforms/LICM/2003-05-02-LoadHoist.ll
LLVM :: Transforms/LoopInstSimplify/basic.ll
LLVM :: Transforms/LoopSimplify/ashr-crash.ll
LLVM :: Transforms/LoopSimplify/merge-exits.ll
LLVM :: Transforms/LoopUnroll/runtime-loop-multiple-exits.ll
LLVM :: Transforms/LoopUnroll/runtime-loop4.ll
LLVM :: Transforms/LoopUnroll/runtime-multiexit-heuristic.ll
LLVM :: Transforms/LoopUnroll/runtime-unroll-remainder.ll
LLVM :: Transforms/LoopUnroll/unroll-cleanup.ll
LLVM :: Transforms/LoopUnswitch/2007-07-12-ExitDomInfo.ll
LLVM :: Transforms/LoopUnswitch/unswitch-equality-undef.ll
LLVM :: Transforms/LoopVectorize/AMDGPU/packed-math.ll
LLVM :: Transforms/LoopVectorize/X86/already-vectorized.ll
LLVM :: Transforms/LoopVectorize/X86/float-induction-x86.ll
LLVM :: Transforms/LoopVectorize/X86/reg-usage.ll
LLVM :: Transforms/LoopVectorize/X86/small-size.ll
LLVM :: Transforms/LoopVectorize/X86/unroll-pm.ll
LLVM :: Transforms/LoopVectorize/first-order-recurrence.ll
LLVM :: Transforms/LoopVectorize/gcc-examples.ll
LLVM :: Transforms/LoopVectorize/if-conversion.ll
LLVM :: Transforms/LoopVectorize/opt.ll
LLVM :: Transforms/LoopVectorize/read-only.ll
LLVM :: Transforms/LoopVectorize/reduction.ll
LLVM :: Transforms/LoopVectorize/value-ptr-bug.ll
LLVM :: Transforms/LoopVectorize/vectorize-once.ll
LLVM :: Transforms/PhaseOrdering/reassociate-after-unroll.ll
LLVM :: Transforms/PhaseOrdering/simplifycfg-options.ll
LLVM :: Transforms/SimpleLoopUnswitch/2007-07-12-ExitDomInfo.ll
LLVM :: Transforms/SimpleLoopUnswitch/update-scev.ll
LLVM :: Transforms/ThinLTOBitcodeWriter/no-type-md.ll
alex-t abandoned this revision.Jun 20 2019, 6:35 AM