This is an archive of the discontinued LLVM Phabricator instance.

[clang][docs][dataflow] Added an introduction to dataflow analysis
ClosedPublic

Authored by gribozavr on Nov 19 2021, 3:00 AM.

Details

Summary

This documentation supports the dataflow analysis framework (see "[RFC]
A dataflow analysis framework for Clang AST" on cfe-dev).

Since the implementation of the framework has not been committed yet,
right now the doc describes dataflow analysis in general.

Since this is the first markdown document in clang/docs, I added support
for Markdown to clang/docs/conf.py in the same way as it is done in
llvm/docs.

Diff Detail

Event Timeline

gribozavr created this revision.Nov 19 2021, 3:00 AM
gribozavr requested review of this revision.Nov 19 2021, 3:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 19 2021, 3:00 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
sgatev added a subscriber: sgatev.Nov 19 2021, 4:10 AM

Thanks for working on this! Publishing documentation alongside the feature is a great practice. I liked this documentation a lot. I have strong opinion about the ambiguous use of path condition in this document, otherwise I was only nitpicking.

clang/docs/DataFlowAnalysisIntro.md
25

I know, everyone has their favorite set of resources. Let me share mine: https://cs.au.dk/~amoeller/spa/
I think this one might have a bit more content with more examples. Feel free to leave it as is.

50

Nit: I wonder if representing the value of x with an empty set is a good choice. One could argue that x has an indeterminate value that could be represented with the set of all integers using top. Others could argue for a bottom value. Those concepts are not yet introduced. Maybe initializing x in the source code would side step this problem.

72–73

Technically, I think this would be a join-semilattice. To have a lattice, we would need both join and meet.

87–88

If we want to be practical, I wonder if we want to give some guidance how to implement that efficiently. E.g. we could cite a paper like this: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.4911&rep=rep1&type=pdf

112–113

I wonder if we want to cite widening and narrowing in this context to give people some pointers where to look for more info.

183

I think it would make sense to include the whole program state in the comments. E.g. // x is ⊥, n is ⊤.
This could showcase how we constrain the value of n to 42 in one of the branches.

280

While I agree that this is the general formula, people sometimes do small variations, e.g.:

out =  join(transfer(basic_block,in_1), transfer(basic_block,in_2), ..., transfer(basic_block,in_n))

This is less efficient as we end up invoking the transfer function more often, but it can be more precise. E.g. with some ad-hoc notation:

Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
join(in...) : x : {⊤}
transfer("x/2", ...) == x : {⊤}

vs.
Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
transfer("x/2", ...) ==x : {0}, x : {1}, x : {1}, x: {2} == outs
join(outs) == x: {0, 1, 2}
289

Only if our transfer functions are monotonic :)

300

I think many people find it confusing, what is the relationship between symbolic execution and abstract interpretation? I found this answer helpful: https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation
I wonder if it might be worth citing.

366

Here, I think we want to include ptr in the comments describing the program state.

382–383

One could argue that you also cannot read from the pointee, otherwise you have an input-output parameter. There is an example clarifying this below, but I wonder if we want to get the definition right from the beginning.

470

While I do agree we cannot explain everything, like what borrowing or escaping means in this document, I wonder if we want to add some citations.

479–480

I wonder if the distinction between normal states and failure states is useful. In general, we can combine arbitrary number of lattices and propagate all the information in a single pass. I.e., we could have multiple "normal" or "failure" states.

There are multiple ways to combine lattices, we can put them on top of each other, or next to each other introducing new top/bottom values, or we can take their products.

571

It might be a good idea to reference liveness analysis which is often used to answer many related questions:

  • Finding dead stores
  • Finding uninitialized variable
  • Finding a good point to deallocate memory
  • Finding out if it would be safe to move an object
  • ...
809

Here, I think the term path condition becomes ambiguous. Traditionally, path condition is a path sensitive property, in the code snippet below:

if (b1)
{ ... }

if (b1 && b2)
{ ... }

We would reach the second branch on two paths (depending on whether we took the first one). On one of the paths, the path condition will have b1, on the other, it will have !b1.

I think here you mean the non-path-sensitive property. I'd suggest doing a strict distinction and use path condition for path sensitive properties and flow condition for flow sensitive properties.

jryans added a subscriber: jryans.Nov 24 2021, 5:34 AM
gribozavr updated this revision to Diff 390318.Nov 29 2021, 5:50 AM

Address review comments.

gribozavr2 added inline comments.
clang/docs/DataFlowAnalysisIntro.md
25

Forgot about this book -- I also like it a lot, added!

50

Since there is a discussion of these topics below, I agree that avoiding the question at this point is better. Added initialization.

72–73

Corrected.

87–88

Thanks, added.

112–113

I think it should be a separate section, with a slightly different example. To show the value of widening we need a lattice that can abstract the infinite set {0; 1; 2; ...} more precisely than "⊤" (e.g., "non-negative"). There would be also the usual handwaving about figuring out how to make those types of conclusions in the general case. I think adding all that here would break the flow and make the explanation less crisp. Adding a separate section on widening would be good one day, when the dataflow framework gets the feature. I don't think it is necessary right now.

183

So far we have been focusing on figuring out the values of x the local variable, not n the parameter, so I'd rather not deviate. Also, comparing n to 42 wouldn't necessarily lead to constraining the set of values here (for example, in the framework we do it in the flow condition).

280

This section is only attempting to give the readers an intuition of why dataflow works, hence it must be straightforward, not necessarily exhaustive or rigorous. I tried to add your suggestion here, but it breaks the flow. I tried to add it at the end of the section, but it looks out of place. So, sorry, I couldn't fit it here. If you have a specific wording suggestion that works here, I would be happy to apply it.

I think one would need to add a section like "variations on dataflow equations" to properly introduce the idea. It also seems to me that this specific variation is also a special case of a more general idea of deferring join operations to improve precision; that is, instead of doing a join everywhere where classic dataflow prescribes it, we can instead keep exploring separate paths, and only do a join at some later point in the CFG. Similarly we can also unroll loops in a precise way up to a certain number of iterations, and attempt to join/widen only at after that. These are of course important ideas, but they don't help with what this document is trying to achieve.

289

That's an important condition, added.

300

So far I have avoided using the term "abstract interpretation" in this document, or going into more deep topics. However I read this answer and I like it too, so I added a link to it here.

366

Added.

382–383

Corrected.

470

I reworded to avoid using the term "borrow".

479–480

I wonder if the distinction between normal states and failure states is useful.

I'm not sure I understand -- this distinction is useful in this particular approach to solve this problem, since it helps solve the problem? Or are you objecting to the term "failure"?

Of course, in general, an analysis does not need to have failure/normal states, and like you said, if we track information about multiple candidate output parameters at the same time, each can be in either a normal or failure state at every program point independently of other parameters. However, this document provides an example of a solution for this particular problem; the goal is not to solve the problem, but to give the reader an intuition of how dataflow ideas can be applied to solve real problems.

571

Good point, added a reference.

809

Replaced with "flow condition".

xazax.hun accepted this revision.Nov 29 2021, 11:52 AM

Thanks!

clang/docs/DataFlowAnalysisIntro.md
112–113

Sounds good.

183

I see. Yeah, if the framework does not do that (yet), I see how it might be confusing. Let's leave this as is.

280

Agreed, we cannot introduce everything in an introductory document. I mainly wanted to avoid creating the impression that there is one definitive way to data flow analysis and this is it. Maybe one sentence at the end mentioning there are several variations that we do not discuss here could be sufficient.

479–480

Yeah, I found the term failure a bit confusing. It is a regular analysis state, the analysis itself did not fail. I think I understand that this is named failure after the fact that it is blocking the transformation.

Probably my main problem was that it was not clear whether the document suggested having normal and failure states is a general design principle to all analysis or specific to the problem we want to solve. I think in this particular example the idea of combining lattices in a certain way to build larger lattices is the general design principle (combining something that one could think of 3 separate analysis: escape, unsafe reads, and field sensitive initialized variable into one) and the classification of states into normal and failure is specific to the problem.

I think the placement of this section makes it pretty clear that this is just an example and not describing general methods. Probably I took a break before reading this section and read it with the wrong context in mind :)

Feel free to leave this as it is.

This revision is now accepted and ready to land.Nov 29 2021, 11:52 AM
gribozavr2 added inline comments.Dec 6 2021, 2:59 AM
clang/docs/DataFlowAnalysisIntro.md
280

Added a long parenthetical:

(Note that there are other ways to write this equation that produce higher
precision analysis results. The trick is to keep exploring the execution paths
separately and delay joining until later. Hoowever, we won't discuss those
variations here.)

479–480

Sounds good :)

Did you add somewhere what dependency we need in order to build the documentation now as mine is having problems with "recommonmark"

/usr/bin/sphinx-build -n ./docs ./html
Running Sphinx v3.4.3

Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/sphinx/config.py", line 326, in eval_config_file
    execfile_(filename, namespace)
  File "/usr/lib/python3.6/site-packages/sphinx/util/pycompat.py", line 88, in execfile_
    exec(code, _globals)
  File "/buildareas/llvm3/llvm-project/clang/docs/conf.py", line 41, in <module>
    import recommonmark
ModuleNotFoundError: No module named 'recommonmark'
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 9:33 AM

Did you add somewhere what dependency we need in order to build the documentation now as mine is having problems with "recommonmark"

I didn't change the documentation because recommonmark is not a new dependency for llvm-project, llvm/docs was depending on it before this change.