This is an archive of the discontinued LLVM Phabricator instance.

[ast-matchers] Add hasSubstatementSequence matcher
AbandonedPublic

Authored by LegalizeAdulthood on Jan 2 2022, 9:19 PM.

Details

Reviewers
klimek
Summary

This allows the matching of two specific statements in sequence within
a compound statement. For instance if (x) return true; return false;
can be matched as

compoundStmt(hasSubstatementSequence(ifStmt(), returnStmt()))

Diff Detail

Event Timeline

LegalizeAdulthood requested review of this revision.Jan 2 2022, 9:19 PM
LegalizeAdulthood created this revision.

Improve matching when the first matcher matches multiple times,
but the 2nd matcher doesn't match at first.

aaron.ballman added inline comments.
clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

How do we extend this to support testing arbitrary sequences of statements? (If this supports two statements, someone will find a need for three, etc).

clang/unittests/ASTMatchers/ASTMatchersNarrowingTest.cpp
2636

Some more complex-ish tests: can we match within function try or catch blocks and GNU statement expressions?

LegalizeAdulthood marked an inline comment as done.Jan 3 2022, 11:44 AM
LegalizeAdulthood added inline comments.
clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

Yeah, I was wondering that too. I didn't see (but didn't look extensively) any support for variadic matchers taking a parameter pack.

I stopped at 2 because this suits my immediate needs with readability-simplify-boolean-expr where I have to manually loop over CompoundStmt matches in order to verify that the if (x) return true; return false; constructs consist of two adjacent statements.

clang/unittests/ASTMatchers/ASTMatchersNarrowingTest.cpp
2636

Adding something for GNU statement expressions is a good idea.

I'm open for other suggestions that might improve coverage.

At the moment, I believe I have 100% coverage of the new matcher, with the exception of GNU statement expressions (honestly, I just copy/pasted that from another matcher on CompoundStmt, I wasn't even aware of this GNU extension).

LegalizeAdulthood marked an inline comment as done.

Add stmtExpr test case

LegalizeAdulthood marked an inline comment as done.Jan 3 2022, 12:09 PM
aaron.ballman added inline comments.Jan 4 2022, 2:09 PM
clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

I don't think we have any variadic matchers yet to model on, but I think if this is only needed for one check, we can add it in the current form as a local matcher for that check. Whenever we figure out how to give the better interface through the static and dynamic matchers, then we can figure out how to lift this to the general matcher interface.

WDYT?

clang/unittests/ASTMatchers/ASTMatchersNarrowingTest.cpp
2636

I just double-checked and we do model function try blocks as a body, and we model function body as being a CompoundStmt, so I think we're good enough there. I was more worried we did something odd with those (they're pretty easy to forget about given how uncommon they are).

LegalizeAdulthood marked an inline comment as done.Jan 4 2022, 6:10 PM
LegalizeAdulthood added inline comments.
clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

I don't think it is harmful to make it visible to all and I think it is helpful.
Defining it in ASTMatchers, enables using it in clang-query, for instance.

LegalizeAdulthood marked 2 inline comments as done.Jan 4 2022, 8:30 PM
aaron.ballman added inline comments.Jan 5 2022, 4:23 AM
clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

I contend this is not a generally useful matcher without supporting an arbitrary number of statements. Even then, to be honest, it's questionable whether there's sufficient need for this to be a public matcher. Typically, we don't expose a new public matcher unless there's a general need for it, and this one is already pretty borderline even if it's fully generalized. This file is *very* expensive to instantiate and it gets used in a lot of places, so that's one of the primary reasons we don't expose matchers from here unless they're generally useful.

Unless @klimek or another AST matcher code owner thinks this is useful in general (to multiple checks people are likely to want to write even if they're not contributing the checks to the community), I'm opposed to exposing this as-is. However, adding it as a private matcher for the check that needs the limited functionality would get no opposition from me (and this implementation looks correct as well).

clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

My thoughts:

I'm OK with moving it as a private matcher as it would simplify a big chunk of code
in the simplify-boolean-expr check.

I agree that ASTMatchers.h is used all over the place and at a minimum causes a
huge amount of rebuild.

Regarding the general usefulness of the matcher, let me elaborate on my motivation
for adding this matcher.

I see it from the viewpoint of a developer of checks/refactorings.

I really want us to get to a world where a complete refactoring can be specified as
a script input to a refactoring tool. Eliminating the need to write C++ that directly
manipulates the AST and the edits will lower the bar for entry for other people
writing automated changes to their codebases.

Since the last time I worked in the clang codebase, the transformer library has
been added. With this new library, I think all we're missing is a parser that matches
the input script to the necessary calls to code in the transformer library. Once we
do this, I think the need for more and higher-level matchers will become evident
and a matcher like this is the only way you can specify that a statement Y
immediately follows statement X purely through matchers. Current matchers
don't even let you specify relative ordering of statements. The best you can do
is assert that a block contains the statements of interest and then you must write
your own C++ code to walk the block and determine if they fit your actual
match criteria.

clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

Also, regarding a variadic version of this matcher, I'd be curious to try it out
just from a learning/programming perspective, but I'm not sure how I'd go about
it. Suggestions on a plan of attack would be most welcome! :)

aaron.ballman added inline comments.
clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

I'm OK with moving it as a private matcher as it would simplify a big chunk of code in the simplify-boolean-expr check.

Thank you!

Regarding the general usefulness of the matcher, let me elaborate on my motivation for adding this matcher.

Thanks for sharing this, and FWIW, I agree with what you're saying. I'm curious to see how Stencil (CC @ymandel) changes the landscape in this area, but I think there's still room for improvement in the AST matchers as well. If we found a way to fully generalize the matcher from here, I'd be happy to have it as a first-class AST matcher. My only concern with the current form is that it's awkward for cases where you want three or more statements in a row. For example, you might want to find an old-school swap implementation and replace it with a call to std::swap(), and one form of that would look for three statements in a row.

5434–5441

I think mapAnyOf() is a fully variadic matcher that even has dynamic matcher support, so that may be worth looking at. The traverse() match is definitely fully variadic, but it has no dynamic matcher support. FWIW, I think it's acceptable to add the matcher even if we can't figure out the dynamic interface in the initial patch, but it'd be really nice if we could make the dynamic matcher work in the initial offering because that means it's a usable feature within clang-query (which is how a lot of folks explore matcher behavior when writing tidy checks).

clang/include/clang/ASTMatchers/ASTMatchers.h
5434–5441

I'm curious to see how Stencil changes the landscape

Yes, I've been talking with Yitzhak over email about this. As we move
more towards something like Stencil encompassing more checks, I
think we'll find additional matchers that are missing to enable Stencil
for various cases. I've already had to add a bunch of matchers for
the checks I've written and they're not particularly complicated. The
existing set of matchers are driven from need, so as new needs arise
I expect new matchers to arise.

For matchers to be usable by any sort of scripted approach, they have
to be registered with the parsing framework. I am not sure if such
registration is an internal detail or if you can register your own custom
matchers with the parsing framework.

I investigated making this a variadic matcher and I'm going to have
to spend more time understanding the matcher framework implementation
before I'm going to be able to implement it. There's quite a bit of
machinery under the hood.

Thanks for looping me in. I'll try to take a detailed look later today. In the meantime, I'll note that we have something similar internally which I never got around to upstreaming. However, we chose to support arbitrarily many matchers, with this interface:

const clang::ast_matchers::internal::VariadicFunction<
    clang::ast_matchers::internal::Matcher<clang::CompoundStmt>,
    internal::SequenceMatcher<clang::Stmt>,
    internal::hasSubstatementSequenceFunc>
    hasSubstatementSequence = {};

The SequenceMatcher API is:

// The following definitions all support the `hasSubstatementSequence`
// matcher. This matcher supports describing the series of statements in a
// compound statement, in a style inspired by regular expressions. Unlike
// regular expressions, however, these operators are deterministic. Choices are
// tried in order. For optional-style operators (`maybeOne`, `zeroOrMore` and
// `oneOrMore`) the positive choice is considered first.
template <typename T>
internal::SequenceMatcher<T> exactlyOne(
    clang::ast_matchers::internal::Matcher<T> Matcher) {
  return {std::move(Matcher)};
}

template <typename T>
internal::SequenceMatcher<T> maybeOne(
    clang::ast_matchers::internal::Matcher<T> Matcher) {
  return {internal::SequenceElementKind::ZeroOrOne, std::move(Matcher)};
}

template <typename T>
internal::SequenceMatcher<T> zeroOrMore(
    clang::ast_matchers::internal::Matcher<T> Matcher) {
  return {internal::SequenceElementKind::ZeroOrMore, std::move(Matcher)};
}

template <typename T>
internal::SequenceMatcher<T> oneOrMore(
    clang::ast_matchers::internal::Matcher<T> Matcher) {
  return {internal::SequenceElementKind::OneOrMore, std::move(Matcher)};
}

I also implemented a non-deterministic version using backtracking. But, that scared me off because of its potential cost (potential, b/c we could use memoization like packrat parsing to avoid the exponential).

That said, my experience indicates that once you're thinking in terms of sequences, you're probably going to find that you want to match over the CFG, rather than the AST directly.

[...] In the meantime, I'll note that we have something similar internally
which I never got around to upstreaming.

That's very interesting and is certainly more general than what I have
which is just scratching my own immediate itch. I got lost in the variadic
functor/matcher infrastructure last night when I tried to reverse engineer
it to try and generalize what I had.

That said, my experience indicates that once you're thinking in terms
of sequences, you're probably going to find that you want to match
over the CFG, rather than the AST directly.

Is there any tooling for matching the CFG currently? I haven't come across
it but the code base is large and there's lots of stuff I haven't personally
browsed.

Sorry for the delay. Read through the whole thread today. I agree with Aaron -- we shouldn't add this until

  1. we have a general interface
  2. we're sure this serves a general audience

As I mentioned in my previous reply, I have a working general interface that I shared and whose implementation I'd be happy to upstream. But, that brings us to point 2, for which I'm on the fence. I think it serves a general audience but I also suspect its really just a frustrating "not quite good enough" solution.

To the point of opening up matchers to a wider audience and the cost of ASTMatchers.h -- this sounds like a broader issue that we should find a way to address, well beyond this patch. Happy to be in on that discussion, but I don't think I have the cycles now to drive it. :)

Regarding CFG matchers -- none that I know of. We've had it as project on our "interesting clang projects" list (in our heads) for a while, but have yet to invest in it. What we *have* started is a dataflow analysis framework: clang/Analysis/FlowSensitive (see the unittests for some example uses). My main use of hasSubstatementSequence was to find pointer variables which could be safely converted to unique_ptr. I converted that into a dataflow analysis built on the new framework and it's a lot cleaner.

Clarification: I'm totally fine with it existing privately, my concerns were with a public implementation.

OK, I'm going to migrate this to a private matcher in the check
that needs it.

A related question is if it is possible to register private matchers
with the parsing framework?

Based on discussion, I'm going to move this as a private matcher
in the check where I intend to use it.

Therefore, I'm abandoning this review.

I look forward to seeing Yitzhak's generalized matcher :)