This is an archive of the discontinued LLVM Phabricator instance.

[dataflow] avoid more accidental copies of Environment
ClosedPublic

Authored by sammccall on Jun 21 2023, 8:19 PM.

Details

Summary

This is clunky but greatly improves debugging of flow conditions - each
copy adds more indirections in the form of flow condition tokens.

(LatticeEffect presumably once did something here, but it's now both
unused and untested.)

For the exit flow condition of:

void target(base::Optional<int*> opt) {
  if (opt.value_or(nullptr) != nullptr) {
    opt.value();
  } else {
    opt.value(); // unsafe
  }
}

Before:

(B0:1 = V15)
(B1:1 = V8)
(B2:1 = V10)
(B3:1 = (V4 & (!V7 => V6)))
(V10 = (B3:1 & !V7))
(V12 = B1:1)
(V13 = B2:1)
(V15 = (V12 | V13))
(V3 = V2)
(V4 = V3)
(V8 = (B3:1 & !!V7))
B0:1
V2

After D153491:

(B0:1 = (V9 | V10))
(B1:1 = (B3:1 & !!V6))
(B2:1 = (B3:1 & !V6))
(B3:1 = (V3 & (!V6 => V5)))
(V10 = B2:1)
(V3 = V2)
(V9 = B1:1)
B0:1
V2

After this patch, we can finally see the relations between the flow
conditions directly:

(B0:1 = (B2:1 | B1:1))
(B1:1 = (B3:1 & !!V6))
(B2:1 = (B3:1 & !V6))
(B3:1 = (V3 & (!V6 => V5)))
(V3 = V2)
B0:1
V2

(I believe V2 is the FC for the InitEnv, and V3 is introduced when
computing the input state for B3 - not sure how to eliminate it)

Diff Detail

Event Timeline

sammccall created this revision.Jun 21 2023, 8:19 PM
Herald added a project: Restricted Project. · View Herald Transcript
sammccall requested review of this revision.Jun 21 2023, 8:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2023, 8:19 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
sammccall updated this revision to Diff 533463.Jun 21 2023, 8:22 PM

refactor, restore comment

The idea with the change to Environment::join() is that the current signature forces the base environment to be mutable, but this sometimes forces the caller to make a copy, and the implementation actually copies everything anyway...

BTW i thought this was just nice for debugging and saving copying a few maps, but Dmitri says that the way even trivial indirections like (FC1 = FC2), (FC2 = ...) are expressed in the SAT solver means they can add significant cost there too...

This revision was not accepted when it landed; it landed in state Needs Review.Jun 26 2023, 6:58 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
cor3ntin added a subscriber: cor3ntin.EditedJun 26 2023, 7:23 AM

This seem to produce build failures when building clang-test

/home/cor3ntin/dev/compilers/LLVM/llvm-project/clang/unittests/Analysis/FlowSensitive/RecordOpsTest.cpp:50:27: error: calling a private constructor of class 'clang::dataflow::Environment'
        Environment Env = getEnvironmentAtAnnotation(Results, "p");
                          ^
/home/cor3ntin/dev/compilers/LLVM/llvm-project/clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h:556:3: note: declared private here
  Environment(const Environment &) = default;
  ^
/home/cor3ntin/dev/compilers/LLVM/llvm-project/clang/unittests/Analysis/FlowSensitive/RecordOpsTest.cpp:112:27: error: calling a private constructor of class 'clang::dataflow::Environment'
        Environment Env = getEnvironmentAtAnnotation(Results, "p");
                          ^
/home/cor3ntin/dev/compilers/LLVM/llvm-project/clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h:556:3: note: declared private here
  Environment(const Environment &) = default;
  ^

Sorry, this was a mid-air collision with f2123af1e7d75, fixed
in 1adc5a781faa

Sorry, this was a mid-air collision with f2123af1e7d75, fixed
in 1adc5a781faa

Thanks :)

Sorry, I got confused between branches and landed this one without approval

Going to review, review would still be appreciated

Sorry, going to revert, review would still be appreciated

Just FYI (since you mentioned it in your description): LatticeEffect is vestigial and should be removed. It's now only used (properly) for widening.

Regarding the design. This looks like an optimal solution in terms of copying but introduces lifetime risks and complexity that I'm uncomfortable with. There's a lot of flexibility here and I wonder if we can reduce that without (substantially?) impacting the amount of copying. Specifically, I wonder if we can have the builder *always* own an environment, which simplifies the handling of CurrentState and, generally, the number of cases to consider. It costs more in principle, but maybe negligible in practice?

clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
613

nit: i think you can now drop the braces.

clang/lib/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.cpp
208

Is there some invariant that guarantees the safety here? Because const & can bind to a temporary, so this feels like a risky API.

sammccall marked an inline comment as done.Jun 26 2023, 12:01 PM

Regarding the design. This looks like an optimal solution in terms of copying but introduces lifetime risks and complexity that I'm uncomfortable with.

FWIW I agree with the complexity (not the lifetime risks).
I think it ultimately stems from having too many interacting types & functions with complicated and inconsistent signatures & constraints (TypeErasedDataflowAnalysisState, Environment, TypeErasedLattice, Lattice, the various join functions, etc) so we resort to case-bashing.
However this system is difficult to make changes to, and getting consensus as to what changes are good also doesn't seem easy. So I think we need to be able to fix bugs within the existing design (where copies need to be avoided ad-hoc).

There's a lot of flexibility here and I wonder if we can reduce that without (substantially?) impacting the amount of copying. Specifically, I wonder if we can have the builder *always* own an environment, which simplifies the handling of CurrentState and, generally, the number of cases to consider. It costs more in principle, but maybe negligible in practice?

I don't understand the specific suggestion:

  • if we create the builder lazily, we're just moving handling this extra state into the callsites
  • if the builder owns a copy of initenv, its FC token will end up in the result FC
  • if the builder owns a copy of initenv and detects when it should discard it, that just sounds like nullptr with extra steps

can you clarify?

clang/lib/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.cpp
208

The invariant is that an external/unowned environment will outlive the joiner, which is just a temporary helper object.

Added comments to the class & CurrentState.

const & can bind to a temporary

given that the method name is addUnowned I think "i shouldn't pass null" is more important to communicate than "I shouldn't pass a temporary", though.

sammccall reopened this revision.Jun 26 2023, 12:03 PM

address comments

ymandel accepted this revision.Jun 26 2023, 12:44 PM

I don't want to block the progress you're making elsewhere and I think the concerns here are sufficiently localized to revisit another time. So, feel free to reify any suggestions/disagreements in FIXMEs and submit.

Regarding the design. This looks like an optimal solution in terms of copying but introduces lifetime risks and complexity that I'm uncomfortable with.

FWIW I agree with the complexity (not the lifetime risks).
I think it ultimately stems from having too many interacting types & functions with complicated and inconsistent signatures & constraints (TypeErasedDataflowAnalysisState, Environment, TypeErasedLattice, Lattice, the various join functions, etc) so we resort to case-bashing.
However this system is difficult to make changes to, and getting consensus as to what changes are good also doesn't seem easy. So I think we need to be able to fix bugs within the existing design (where copies need to be avoided ad-hoc).

I realize the complexity is frustrating, but I don't see how that's related to the issue here. The complexity of the JoinedStateBuilder is caused by the desire to minimize copies. The multiple cases come directly from the algorithm itself, most particularly that we may or may not be in an initial state, and the previous block may or may not have an environment we care about. At least, that is the specific issue I'm concerned with.

There's a lot of flexibility here and I wonder if we can reduce that without (substantially?) impacting the amount of copying. Specifically, I wonder if we can have the builder *always* own an environment, which simplifies the handling of CurrentState and, generally, the number of cases to consider. It costs more in principle, but maybe negligible in practice?

I don't understand the specific suggestion:

  • if we create the builder lazily, we're just moving handling this extra state into the callsites
  • if the builder owns a copy of initenv, its FC token will end up in the result FC
  • if the builder owns a copy of initenv and detects when it should discard it, that just sounds like nullptr with extra steps

can you clarify?

I suppose I'm thinking along the lines of:

  • if the builder owns a copy of initenv, its FC token will end up in the result FC

Can you expand on why this is a problem? That would help motivate the complexity. "join with InitEnv" should reduce to the identity function, so if its not, that feels like a problem we're sweeping under the rug.

This revision is now accepted and ready to land.Jun 26 2023, 12:44 PM

I don't want to block the progress you're making elsewhere and I think the concerns here are sufficiently localized to revisit another time. So, feel free to reify any suggestions/disagreements in FIXMEs and submit.

Regarding the design. This looks like an optimal solution in terms of copying but introduces lifetime risks and complexity that I'm uncomfortable with.

FWIW I agree with the complexity (not the lifetime risks).
I think it ultimately stems from having too many interacting types & functions with complicated and inconsistent signatures & constraints (TypeErasedDataflowAnalysisState, Environment, TypeErasedLattice, Lattice, the various join functions, etc) so we resort to case-bashing.
However this system is difficult to make changes to, and getting consensus as to what changes are good also doesn't seem easy. So I think we need to be able to fix bugs within the existing design (where copies need to be avoided ad-hoc).

I realize the complexity is frustrating, but I don't see how that's related to the issue here. The complexity of the JoinedStateBuilder is caused by the desire to minimize copies. The multiple cases come directly from the algorithm itself, most particularly that we may or may not be in an initial state, and the previous block may or may not have an environment we care about. At least, that is the specific issue I'm concerned with.

We have 9 state transitions (3 states x 3 operations). Each would be less than half as complicated if it didn't have to deal with Lattice's join being mutating and Environment's being non-mutating.

I believe further simplifications are possible (essentially: back at the callsite track vector<const Env*> All plus deque<Environment> Owned for owned copies, special case All.size() == 0 and All.size() == 1 and otherwise joinVec(All).
However with Lattice's mutating join involved this no longer works, and after addressing that it's no longer simpler.

There's a lot of flexibility here and I wonder if we can reduce that without (substantially?) impacting the amount of copying. Specifically, I wonder if we can have the builder *always* own an environment, which simplifies the handling of CurrentState and, generally, the number of cases to consider. It costs more in principle, but maybe negligible in practice?

I don't understand the specific suggestion:

  • if we create the builder lazily, we're just moving handling this extra state into the callsites
  • if the builder owns a copy of initenv, its FC token will end up in the result FC
  • if the builder owns a copy of initenv and detects when it should discard it, that just sounds like nullptr with extra steps

can you clarify?

I suppose I'm thinking along the lines of:

  • if the builder owns a copy of initenv, its FC token will end up in the result FC

Can you expand on why this is a problem? That would help motivate the complexity.

The purpose of this change is to stop introducing meaningless variables into the FC, because they impede debugging and slow down the SAT solver.
An extra join with a copy of InitEnv introduces up to 3 extra variables: one for InitEnv, one for the copy, one for the join.

"join with InitEnv" should reduce to the identity function, so if its not, that feels like a problem we're sweeping under the rug.

Only in a mathematical sense - in reality humans need to read these SAT systems, and we're using a solver that can't perform that reduction.

I realize the complexity is frustrating, but I don't see how that's related to the issue here. The complexity of the JoinedStateBuilder is caused by the desire to minimize copies. The multiple cases come directly from the algorithm itself, most particularly that we may or may not be in an initial state, and the previous block may or may not have an environment we care about. At least, that is the specific issue I'm concerned with.

We have 9 state transitions (3 states x 3 operations). Each would be less than half as complicated if it didn't have to deal with Lattice's join being mutating and Environment's being non-mutating.

I believe further simplifications are possible (essentially: back at the callsite track vector<const Env*> All plus deque<Environment> Owned for owned copies, special case All.size() == 0 and All.size() == 1 and otherwise joinVec(All).
However with Lattice's mutating join involved this no longer works, and after addressing that it's no longer simpler.

I'm pretty sure I agree on all of this, please document it somewhere (file an Issue in the github tracker?). FWIW, I think mutating join was a case of premature optimization gone bad, and ultimately the builtin lattice should not be built in at all -- it should exist outside the core and be just another client.

"join with InitEnv" should reduce to the identity function, so if its not, that feels like a problem we're sweeping under the rug.

Only in a mathematical sense - in reality humans need to read these SAT systems, and we're using a solver that can't perform that reduction.

I think we're talking past each other (and actually agreeing), sorry. To be clear, I don't mean that as a criticism of this change -- I think you've hit an important point and fixing it might make this simpler. I totally agree about SAT systems, and the solver, etc. But, I'm saying that it shouldn't come to that. I think the Environment join operation is broken if join(InitEnv, E) != E. I'm asking whether we can reduce (some) complexity by fixing that. From your earlier point, there are more problems, but I want to focus on this particular one in case solving it makes other things easier.

Thanks! I'll land this to eliminate the variables, and follow up with trying to simplify join().

I realize the complexity is frustrating, but I don't see how that's related to the issue here. The complexity of the JoinedStateBuilder is caused by the desire to minimize copies. The multiple cases come directly from the algorithm itself, most particularly that we may or may not be in an initial state, and the previous block may or may not have an environment we care about. At least, that is the specific issue I'm concerned with.

We have 9 state transitions (3 states x 3 operations). Each would be less than half as complicated if it didn't have to deal with Lattice's join being mutating and Environment's being non-mutating.

I believe further simplifications are possible (essentially: back at the callsite track vector<const Env*> All plus deque<Environment> Owned for owned copies, special case All.size() == 0 and All.size() == 1 and otherwise joinVec(All).
However with Lattice's mutating join involved this no longer works, and after addressing that it's no longer simpler.

I'm pretty sure I agree on all of this, please document it somewhere (file an Issue in the github tracker?). FWIW, I think mutating join was a case of premature optimization gone bad

Thanks - I will file a bug if it turns out not to be trivial, but I think it probably is trivial to fix so I'll give that a shot first.

and ultimately the builtin lattice should not be built in at all -- it should exist outside the core and be just another client.

Yeah, this looks like a bigger change :-)

"join with InitEnv" should reduce to the identity function, so if its not, that feels like a problem we're sweeping under the rug.

Only in a mathematical sense - in reality humans need to read these SAT systems, and we're using a solver that can't perform that reduction.

I think we're talking past each other (and actually agreeing), sorry. To be clear, I don't mean that as a criticism of this change -- I think you've hit an important point and fixing it might make this simpler. I totally agree about SAT systems, and the solver, etc. But, I'm saying that it shouldn't come to that. I think the Environment join operation is broken if join(InitEnv, E) != E. I'm asking whether we can reduce (some) complexity by fixing that. From your earlier point, there are more problems, but I want to focus on this particular one in case solving it makes other things easier.

Ah, got it, thanks!
I think if we can change Environment (e.g. dropping FC tokens) so this just falls out, and it makes sense for other reasons, that'd be great.
I'm not a big fan of special-casing InitEnv somehow though.