Page MenuHomePhabricator

Allow non-variadic functions to be attributed with `__attribute__((format))`
Needs ReviewPublic

Authored by fcloutier on Oct 26 2021, 3:35 PM.

Details

Summary

Clang only allows you to use __attribute__((format)) on variadic functions. There are legit use cases for __attribute__((format)) on non-variadic functions, such as:

(1) variadic templates

template<typename… Args>
void print(const char *fmt, Args… &&args) __attribute__((format(1, 2))); // error: format attribute requires variadic function

(2) functions which take fixed arguments and a custom format:

void print_number_string(const char *fmt, unsigned number, const char *string) __attribute__((format(1, 2)));
// ^error: format attribute requires variadic function

void foo(void) {
    print_number_string(“%08x %s\n”, 0xdeadbeef, “hello”);
    print_number_string(“%d %s”, 0xcafebabe, “bar”);
}

This change allows Clang users to attach __attribute__((format)) to non-variadic functions, including functions with C++ variadic templates. It replaces the error with a GCC compatibility warning and improves the type checker to ensure that received arrays are treated like pointers (this is a possibility in C++ since references to template types can bind to arrays).

Diff Detail

Event Timeline

fcloutier created this revision.Oct 26 2021, 3:35 PM
fcloutier requested review of this revision.Oct 26 2021, 3:35 PM

Thank you for this! I have some concerns that I'd like to talk out to hopefully make myself feel more comfortable with this.

My understanding of the utility from -Wformat warnings comes from the fact that the function receiving the format string has no way to validate that the variadic arguments passed have the correct types when compared to the format string. However, in both of the new cases you propose to support, that type information is present in the function signature (you may have to go through a bunch of template instantiations in the variadic template cases, but you get there eventually). Having the types in the signatures changes something I think might be pretty fundamental to the way the format string checker works -- it tries to figure out types *after default argument promotions* because those promotions are a lossy operation. However, when the types are specified in the signature, those default argument promotions no longer happen. The type passed for a %f may actually be a float rather than promoted to a double, the type for a char may actually be char rather than int, etc. I'm not convinced this is as simple as "now we allow non-vararg functions" for the implementation. I think we need some more extensive testing to prove that the diagnostics actually make sense or not.

In terms of the specific cases allowed, I think I am happier about variadic templates than I am about fixed-signature functions. For the variadic template, I think supporting -Wformat would mean users don't have to reimplement similar logic to what that check already handles even though they can do it themselves. But for a fixed-signature function, I'm not certain I see what the value add is -- the only valid format strings such a function could accept would have to be fixed to the function signature anyway. Can you explain what use cases you have for that situation?

Thanks for looking, Aaron. You're right that the main utility of the aggregation of format warnings is to extend C's type checking because there is no other good way, or good place, to do it. I have built hundreds of millions of shipping lines of C, C++ and Objective-C, and this change seems like it would be an effective fix in several places where we don't currently have anywhere else to go.

For variadic templates, you're right that at some final instantiation, the compiler will have all the format argument types in hand. What gets lost in piping and substitution is actually the format string that Clang must type-check against. You can see this in action in the LLVM code base, actually: if we turn on -Wformat-nonliteral for files including llvm/include/Support/Format.h, the warning will trigger for users of the llvm::format function. This is because the format string is stored in format_object_base::Fmt instead of being directly forwarded, and sprinkling constexpr in strategic places won't resolve this issue because SemaChecking has its own custom expression evaluator. Swapping it out for the more common ExprConstant stuff is probably not impossible, but it's difficult because SemaChecking supports its own kind of symbolism. For instance, it's OK to use a format string parameter as the format string argument of another function that wants one, and other attributes like format_arg can tell you to assume the same specifiers as you get from another expression. So, we could fix this, but it would be more work, and the same purpose is served by allowing fixed arguments to participate in the format attribute checking. Verifying that users of the LLVM llvm::format function are doing the right thing is a better experience than letting this bubble up from however many levels of template instantiation there is.

Type checking may well actually need to be tested better for the case where variadic argument promotions don't happen, but there's a fairly finite number of ways it could go wrong, and I still think that's the easier way to go.

I don't think that we can easily distinguish between parameters created by variadic templates and fixed parameters from a function without variadic templates. I also think that most people who write functions like llvm::format aggressively do not want to be in charge of type-checking the format string themselves, and would much rather defer to Clang's existing type checker. If Clang allows the format attribute to type-check parameters created by variadic templates, it follows that the path of least resistance is to allow it on functions with fixed arguments.

With that said, there are still use cases for allowing format strings in functions with fixed arguments. (Interestingly, it would not be possible to mix fixed format arguments with variadic format arguments in a C function, so maybe we can prevent that altogether, but it does not remove from the general usefulness of the feature). You'll have to take my word for this, but it's one of the handful of blockers that we have for very broad adoption of -Wformat-nonliteral. The story usually goes like this: there's some function NSString *foo(NSString *fmt, NSMyFoo *obj1, NSMyBar *obj2) that needs to do something with obj1 and obj2 before it decides whether it actually wants to print them, return something unrelated altogether, throw an exception, etc. General limitations of the C language make it difficult to untangle the print parts from the logic parts for reasons which, to me, align rather closely to compiler capriciousness given how close it is to supporting the feature. This is more common in Objective-C code bases as %@ is a universal format specifier for all object types.

Another way we could approach this problem is to have a new attribute to say that a format argument must have a string that conforms to another constant format string. For instance, I think that NSString *foo(NSString *fmt, NSMyFoo *obj1, NSMyBar *obj2) __attribute__((format_like(1, @"hello %@ %@!"))), by the compiler checks that fmt has specifiers equivalent to %@ %@, would give us roughly the same safety improvements. It would allow for even more fun things, like passing a format string to a function that will then itself supply arguments to a formatting function.

What do you think?

[...] Having the types in the signatures changes something I think might be pretty fundamental to the way the format string checker works -- it tries to figure out types *after default argument promotions* because those promotions are a lossy operation. However, when the types are specified in the signature, those default argument promotions no longer happen. The type passed for a %f may actually be a float rather than promoted to a double, the type for a char may actually be char rather than int, etc.

True, and I think you're right that this change needs some really exhaustive tests. First, notice that the point of format(printf) is still ultimately to pass the arguments to some printf-like varargs function that will do the default argument promotions. So if our argument type is float, well, in the non-pathological cases, eventually it should become a double (when it is finally passed to printf, possibly several levels deeper in the function call stack). But second, consider situations like

void myprintf(const char *fmt, int n) __attribute__((format(printf, 1, 2)));  // N.B.: int, not unsigned long
int main() {
    myprintf("%lu", 3uL);  // this should error
    myprintf("%d", 3uL);  // this should not error
}

In terms of the specific cases allowed, I think I am happier about variadic templates than I am about fixed-signature functions. [...] For a fixed-signature function, I'm not certain I see what the value add is -- the only valid format strings such a function could accept would have to be fixed to the function signature anyway.

Or, in some situations (like logging), the format string could be some weird amalgam of the signature and something else, right?

void mydebug(const char *fmt, int line) { printf(fmt, "test.cpp", line); }
int main() {
    mydebug("%s:%d", 42);  // printf("%s:%d", 42) would not be correct; but the way mydebug ultimately uses printf, this is actually safe and intentional
}

When we receive a va_list or a varargs ..., this scenario doesn't arise (AFAIK), because there is no way to manipulate or insert-more-arguments-into a va_list after the fact.

Initially I thought this PR was a slam-dunk, but thinking about all the pathological-but-possible corner cases here does make me more skeptical now.

Thanks Arthur for your feedback.

void myprintf(const char *fmt, int n) __attribute__((format(printf, 1, 2)));  // N.B.: int, not unsigned long
int main() {
    myprintf("%lu", 3uL);  // this should error
    myprintf("%d", 3uL);  // this should not error
}

This is handled naturally by the current implementation. The integer literal undergoes an implicit cast to int because that's the type of the n parameter, and it causes the %lu case to fail and the %d case to succeed.

Your second example is the scenario of concern for adding an attribute like the format_like attribute that I described in my response to Aaron. I think that these two features don't need to be tied together.

Thanks for looking, Aaron.

Thank you for the detailed response!

Type checking may well actually need to be tested better for the case where variadic argument promotions don't happen, but there's a fairly finite number of ways it could go wrong, and I still think that's the easier way to go.

I think this is the current thing for us to test and try out. It shouldn't be impossible to do somewhat exhaustive testing (there's only so many format specifiers and length modifiers to worry about), but it's those edge cases that have me worried.

I don't think that we can easily distinguish between parameters created by variadic templates and fixed parameters from a function without variadic templates.

Agreed. This should be checking the instantiations, so by that point, the variadic template is really more like a fixed parameter list anyway.

I also think that most people who write functions like llvm::format aggressively do not want to be in charge of type-checking the format string themselves, and would much rather defer to Clang's existing type checker. If Clang allows the format attribute to type-check parameters created by variadic templates, it follows that the path of least resistance is to allow it on functions with fixed arguments.

Also agreed.

With that said, there are still use cases for allowing format strings in functions with fixed arguments. (Interestingly, it would not be possible to mix fixed format arguments with variadic format arguments in a C function, so maybe we can prevent that altogether, but it does not remove from the general usefulness of the feature). You'll have to take my word for this, but it's one of the handful of blockers that we have for very broad adoption of -Wformat-nonliteral. The story usually goes like this: there's some function NSString *foo(NSString *fmt, NSMyFoo *obj1, NSMyBar *obj2) that needs to do something with obj1 and obj2 before it decides whether it actually wants to print them, return something unrelated altogether, throw an exception, etc. General limitations of the C language make it difficult to untangle the print parts from the logic parts for reasons which, to me, align rather closely to compiler capriciousness given how close it is to supporting the feature. This is more common in Objective-C code bases as %@ is a universal format specifier for all object types.

Another way we could approach this problem is to have a new attribute to say that a format argument must have a string that conforms to another constant format string. For instance, I think that NSString *foo(NSString *fmt, NSMyFoo *obj1, NSMyBar *obj2) __attribute__((format_like(1, @"hello %@ %@!"))), by the compiler checks that fmt has specifiers equivalent to %@ %@, would give us roughly the same safety improvements. It would allow for even more fun things, like passing a format string to a function that will then itself supply arguments to a formatting function.

What do you think?

I think the second option may be an idea worth exploring, but if we can avoid adding another attribute, that's usually preferred. I think that if we add sufficient test coverage, it'd make sense to use the existing attribute. As Arthur pointed out, ultimately this stuff is expected to be passed to printf (et al) and so long as the attribute continues to honor pointing out issues with that goal, I think it's a reasonable one to use in these situations. I think we only need to consider a secondary attribute if we find that the semantics we need are sufficiently different to warrant it.

Agreed. This should be checking the instantiations, so by that point, the variadic template is really more like a fixed parameter list anyway.

FWIW, in my own mental model, there's a big semantic difference between (varargs functions, variadic templates) on the one hand and (non-template functions) on the other. In my experience, there's nothing you can do with a varargs ellipsis except forward it along as a va_list; and it's uncommon to do anything with a variadic parameter pack except forward it along via std::forward; but messing with fixed arguments is quite common. Technically, you can mess with a parameter pack too:

void apple(const char *fmt, int x, int y) __attribute__((format(printf, 1, 2))) {
    printf(fmt, double(x), double(y));
}
void banana(const char *fmt, auto... args) __attribute__((format(printf, 1, 2))) {
    printf(fmt, double(args)...);
}
int main() {
    apple("%g %g", 17, 42);  // well-defined; shall we warn anyway? (My gut feeling is that this is relatively common)
    banana("%g %g", 17, 42);  // well-defined; shall we warn anyway? (My gut feeling is that this is extremely rare)
}

This morning I am leaning in favor of this PR as written. If a programmer wants apple/banana to be callable like that, without any diagnostics, then their appropriate course of action is simply not to apply the __attribute__((format(printf, 1, 2))) annotation.

clang/include/clang/Basic/DiagnosticSemaKinds.td
4051

I'd say with the %0 attribute (add "the")

clang/lib/AST/FormatString.cpp
325–329

Also, function references will decay to function pointers. I have no idea if you need to do anything special here to get the "right behavior" for function references. But please add a (compile-only?) test case for the function-pointer codepath, just to prove it doesn't crash or anything.

clang/test/Sema/attr-format.cpp
5–7

Can we also do an example of a member function variadic template?

struct S {
  template<class... Args>
  void format(const char *fmt, Args&&... args)
    __attribute__((format(printf, 2, 3)));
};

Also, I believe that this entire file should be removed from Sema/ and combined into SemaCXX/attr-format.cpp.

I also notice that we have literally zero test coverage for cases where the format string is not the first argument to the function; but that can-and-should be addressed in a separate PR.

14

Basically, add , do_format here.
(Aside: I'm surprised that Clang quietly lets you print a function pointer with %p. It'll work on sane architectures, but in general C and C++ don't require that function pointers even be the same size as void* — technically this should be UB or at least impl-defined behavior.)