This is an archive of the discontinued LLVM Phabricator instance.

Reimplement `__builtin_dump_struct` in Sema.
ClosedPublic

Authored by rsmith on Apr 21 2022, 5:52 PM.

Details

Summary

Compared to the old implementation:

  • In C++, we only recurse into aggregate classes.
  • Unnamed bit-fields are not printed.
  • Constant evaluation is supported.
  • Proper conversion is done when passing arguments through ....
  • Additional arguments are supported and are injected prior to the format string; this directly supports use with fprintf, for example.
  • An arbitrary callable can be passed rather than only a function pointer. In particular, in C++, a function template or overload set is acceptable.
  • All text generated by Clang is printed via %s rather than directly; this avoids issues where Clang's pretty-printing output might itself contain a % character.
  • Fields of types that we don't know how to print are printed with a "*%p" format and passed by address to the print function.
  • No return value is produced.

Diff Detail

Event Timeline

rsmith created this revision.Apr 21 2022, 5:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 5:52 PM
rsmith requested review of this revision.Apr 21 2022, 5:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 5:52 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
rsmith edited the summary of this revision. (Show Details)Apr 21 2022, 5:52 PM
rsmith added a comment.EditedApr 21 2022, 6:00 PM

I'm not entirely sure whether I want to pursue this -- I'd prefer to have only a single mechanism that works well in both C and C++, rather than this (which is flexible but not really usable in C) and __builtin_dump_struct (which is inflexible but works well for C) -- but I've not found a good way to expose this functionality that provides good support for both use cases yet. Thoughts on that are welcome.

This came from wanting to write up how I would have like to have seen __builtin_dump_struct be implemented. Note in particular that all of the work is done in Sema with proper semantic checking, rather than in CodeGen, which avoids the problems that __builtin_dump_struct has with converting arguments to the right type, provides a lot more flexibility in the function argument, and supports constant evaluation for free.

This is somewhat WIP: in particular, access checking is not done consistently. I'm not sure whether we should ignore access control or properly enforce it; the derived-to-base conversions currently enforce it and the field accesses ignore it, which is certainly not the right answer.

rsmith updated this revision to Diff 424350.Apr 21 2022, 6:14 PM
  • Some improvements.
rsmith updated this revision to Diff 424379.Apr 21 2022, 10:47 PM
  • Fix some code that was making incorrect assumptions that PseudoObjectExpr is only used for ObjC properties.

I generally just question the usefulness of this. Despite its shortcomings, the 'dump struct' has the capability of runtime introspection into a type, whereas this seems to require that the user 'know' a significant amount about the type that they are introspecting (number of bases, number of fields, field TYPES). If you knew the type of your struct, THIS well, I would expect you wouldn't NEED a builtin like this.

clang/docs/LanguageExtensions.rst
2472

I'm curious as to the need for the 'arg1' and 'arg2'? Also, what is the type of the 'fields' array that works in C?

2475

Hmm... what is the type of f.bitwidth? A zero-length bitwidth is a valid thing, so having it just be an 'int' doesn't seem workable.

2498

While I get the value of a generic reflection 'thing', this is once again getting awful close to 'just implement reflection capabilities', at which point this seems insufficient.

2500

Is the purpose of the overload set/functor so you can support multiple types (seeing how the 'base' below works)? How does this all work in C? (or do we just do fields in C, since they cannot have the rest?

I wonder if we would be better if we treated this function as a 'callback' or a 'callback set' that returned a discriminated union of a pre-defined type that describes what is being returned. That way it could handle member functions, static variables, etc.

2502

Same question about init-list, how is this usable in C? It becomes a BIT more usable in C if instead we return a single object that represents a single AST node.

2509

I find myself wishing this was a more complete list.

2511

I wonder if bases should include their virtualness or access specifier.

clang/lib/Sema/SemaChecking.cpp
434

Losing the unnamed bitfield is unfortunate, and I the 'dump struct' handles. As is losing the anonymous struct/union.

Also, how does this handle 'record type' members? Does the user have to know the inner type already? If they already know that much information about the type, they wouldn't really need this, right?

clang/test/CodeGenCXX/builtin-reflect-struct.cpp
24 ↗(On Diff #424379)

OH! I see now. This is unfortunately quite limiting by doing it this way. At this point the user of this builtin is required to know the layout of the struct before calling the builtin, which seems unfortunate. Despite the downsides of being just a 'dump function', the other builtin had the benefit of working on an arbitrary type.

I generally just question the usefulness of this. Despite its shortcomings, the 'dump struct' has the capability of runtime introspection into a type, whereas this seems to require that the user 'know' a significant amount about the type that they are introspecting (number of bases, number of fields, field TYPES).

I think you've misunderstood; that is not required. Though with the design as-is, it might make sense to restrict this to being C++-only, given that there's not really a way to effectively use this from C.

clang/docs/LanguageExtensions.rst
2472

This is aimed to fix a major deficiency in __builtin_dump_struct, that there's no way to pass information into its callback (other than stashing it in TLS).

As noted in the patch description, this builtin is not especially usable in C. I'm not sure whether that's fixable, given the very limited metaprogramming support in C.

2475

It's a size_t. There are no zero-size bitfield members. (Unnamed bitfield aren't members, so are excluded here.) In any case, you can tell the difference between a bitfield and a regular member by the length of the initialiser list; we don't pass a bit width for non-bit-field members.

2500

This doesn't really work well in C, as noted in the patch description. A set of callbacks might work a bit better across both languages, but I don't think it solves the larger problem that there's no good way to pass type information into a callback in C.

2509

What else do you want? My goal here was to be able to do what __builtin_dump_struct does but without its many arbitrary limitations. From C++, this is enough for that.

2511

I mean, maybe, but I don't have a use case for either. For access in general my leaning is to say that the built-in doesn't get special access rights -- either it can access everything or you'll get an access error if you use it -- in which case including the access for bases and fields doesn't seem likely to be super useful.

clang/lib/Sema/SemaChecking.cpp
434

If __builtin_dump_struct includes unnamed bitfields, that's a bug in that builtin.

In general, the user function gets given the bases and members with the right types, so they can use an overload set or a template to dispatch based on that type. See the SemaCXX test for a basic example of that.

clang/test/CodeGenCXX/builtin-reflect-struct.cpp
24 ↗(On Diff #424379)

Well, no, this is set up this way to test code generation. See the SemaCXX test for an example of using this to dump an arbitrary type.

I think you've misunderstood; that is not required. Though with the design as-is, it might make sense to restrict this to being C++-only, given that there's not really a way to effectively use this from C.

Ah, I see the example in SemaCXX. its sad we can't do this from C, I would consider the use case of __builtin_dump_struct to be very handy in C.

I guess I think I would want to start off with, "what do we see the 'usecase' of these builtins" being. __builtin_dump_struct to me seems to have a pretty clear and semantically enforced one: its useful to printf it for debugging purposes.

This seems to be more useful-enough that it leaves me feeling like users would very quickly fall into, "man, I wish this did more" because of it. Its just a shame we can't get SG7 to work quicker :)

clang/docs/LanguageExtensions.rst
2472

Presumably one could also put it in the functor for the callback, or even as a capture in the lambda. Though, I'm not positive I see the VALUE to passing this information, but was just curious as to the intent.

2475

There are no zero-size bitfield members. (Unnamed bitfield aren't members, so are excluded here.)

Hrm... Semantically YES, lexically I'm not sure folks make that differentiation. One issue with skipping 'unnamed bitfields' is it makes something like:

struct Foo {
int : 31; // Take up some space so something else could have been put there.
int field : 1; 
int field2 : 2;
};

and comparing sizeof the type with the output to be... jarring at least.

you can tell the difference between a bitfield and a regular member by the length of the initialiser list;

I see now how you've done that with ctors in C++. It feels that the way of doing this in C is unfortunately quite clunky (in addition, not being able to use arbitrary types).

2500

Well, if the callback itself took the discriminated union for each thing, and we did 1 callback per base/field/etc, perhaps that would be useful? I just am having a hard time seeing this being all that useful in C, which is a shame.

2509

I think this is again falling down the "this is close enough to reflection, I want reflection" hole :) I find myself wanting to be able to introspect member functions, static variables/functions/etc.

2511

I guess it depends on the use case for these builtins. I saw the 'dump-struct' builtin's purpose was for quick & dirty printf-debugging. At which point the format/features were less important than the action. So from that perspective, I would see this having 'special access rights' as sort of a necessity.

Though again, this is a significantly more general builtin, which comes with additional potential use cases.

clang/lib/Sema/SemaChecking.cpp
434

I thought it did? For the use case I see __builtin_dump_struct having, I would think printing the existence of unnamed bitfields to be quite useful.

clang/test/CodeGenCXX/builtin-reflect-struct.cpp
24 ↗(On Diff #424379)

Yep, thanks for that. I see now how it can be used that way.

clang/test/SemaCXX/builtin-reflect-struct.cpp
52 ↗(On Diff #424379)

Ah, this is perhaps the example I was looking for.

82 ↗(On Diff #424379)

What would this look like if you needed to look into a union? Perhaps it would be handled similarly to "__is_class"? In both builtins, its a shame we don't have a way to properly handle the 'active member'.

rsmith added inline comments.Apr 22 2022, 10:07 AM
clang/docs/LanguageExtensions.rst
2472

Suppose you want to dump a struct to a log file, or you want the dump in a string. Or you want to carry along an indentation level. My first attempt at using __builtin_dump_struct essentially failed because it doesn't have anything like this; even if we don't want this patch, this is functionality we should add to that builtin.

2500

I think that would work. I'm not sure it would be better, though, given that it loses the ability to build a type whose shape depends on the number of bases and fields, because you're not given them all at once. (Eg, if you want to build a static array of field names.)

2511

The use cases are the things that __builtin_dump_struct gets close to but fails at, such as: I have an EXPECT_EQ macro that I'd like to log the difference in input and output values where possible, or I want to build a simple print-to-string utility for structs. Or I want a dump to stout but I want it formatted differently from what __builtin_dump_struct thinks is best.

The fact that the dump built-in gets us into this area but can't solve these problems is frustrating.

clang/lib/Sema/SemaChecking.cpp
434

Why? They're not part of the value, they're just padding.

rsmith added inline comments.Apr 22 2022, 10:14 AM
clang/lib/Sema/SemaChecking.cpp
434

Looks like __builtin_dump_struct currently includes them *and* prints a value (whatever garbage happens to be in those padding bits). That's obviously a bug.

erichkeane added inline comments.Apr 22 2022, 10:17 AM
clang/docs/LanguageExtensions.rst
2472

Ah, I see! Sure, I can see how those are useful, though again, mostly to the C implementation (as those should all be part of the functor's state), and this doesn't work particularly well in C.

2500

Yep, thats definitely true. There are ways to DO that (either a 2-pass at this to get the sizes, allocate something, then go through again filling in your struct OR a constant 'append' thing with a vector like type), but I see what you mean. It seems that either case we are limiting a couple of useful use cases.

2511

The use cases are the things that __builtin_dump_struct gets close to but fails at, such as: I have an EXPECT_EQ macro that I'd like to log the difference in input and output values where possible

THAT is an interesting use case, I see value to that too.

The fact that the dump built-in gets us into this area but can't solve these problems is frustrating.

I definitely see that. FWIW, I don't think I was the one to approve/review the initial version of __builtin_dump_struct (and perhaps would have been equally frustrated if I thought that through), but was looking into the patches after the fact to help out.

I guess I find myself equally/additionally/etc frustrated at THIS patch because it gets us so close to everything SG7 promises they can invent.

clang/lib/Sema/SemaChecking.cpp
434

They are lexically part of the type and are part of what makes up the 'size' of the thing. I would expect something dumping the type to be as consistent lexically as possible.

434

O.o! Yeah, printing the value is nonsense.

Thank you for looking into this! I've also thought that the __builtin_dump_struct implementation hasn't been quite as robust as it could be. However, the primary use case for that builtin has been for kernel folks to debug when they have no access to an actual debugger -- it being a super limited interface is actually a feature rather than a bug, in some ways.

I think what you're designing here is straight-up reflection, which is a different use case. If we're going to get something that's basically only usable in C++, I'm not certain there's much value in adding builtins for reflection until WG21 decides what reflection looks like, and then we can design around that. (Personally, I think designing usable reflection interfaces for C would be considerably harder but would provide considerably more benefit to users given the lack of reflection capabilities. There's almost no chance WG14 and WG21 would come up with the same interfaces for reflection, so I think we've got some opportunity for exploration here.)

Thank you for looking into this! I've also thought that the __builtin_dump_struct implementation hasn't been quite as robust as it could be. However, the primary use case for that builtin has been for kernel folks to debug when they have no access to an actual debugger -- it being a super limited interface is actually a feature rather than a bug, in some ways.

I'm more concerned about the design of __builtin_dump_struct than the implementation -- we can always fix the implementation. Right now the design is hostile to any use other than the very simplest one where the function you give it is essentially printf. You can't even dump to a log file or to a string without having to fight the design of the builtin. And if you want anything other than the exact recursion and formatting choices that we've arbitrarily made, you're out of luck. And of course, if you're in C++, you'll want to be able to print out std::string and std::optional<int> and similar types too, which are not supported by the current design at all. Perhaps this is precisely what a specific set of kernel folks want for a specific use case, but if the only purpose is to address that one use case, then I don't really see how we can justify keeping this builtin in-tree. So I think we should either expand the scope beyond the specific kernel folks or we should remove it.

I think what you're designing here is straight-up reflection, which is a different use case. If we're going to get something that's basically only usable in C++, I'm not certain there's much value in adding builtins for reflection until WG21 decides what reflection looks like, and then we can design around that. (Personally, I think designing usable reflection interfaces for C would be considerably harder but would provide considerably more benefit to users given the lack of reflection capabilities. There's almost no chance WG14 and WG21 would come up with the same interfaces for reflection, so I think we've got some opportunity for exploration here.)

What I set out to build is something that lets you implement struct dumping without all the shortcomings I see in __builtin_dump_struct. I think it's inevitable that that ends up being substantially a struct reflection system, and indeed __builtin_dump_struct is a reflection system, too, just a really awkward one -- people have already started parsing its format strings to extract struct field names, for example. (The usage I've seen actually *is* struct dumping, but that usage can't use __builtin_dump_struct directly because of its limitations, so __builtin_dump_struct is just used to extract the field names, and the field values and types are extracted via structured bindings instead.)

I do agree that we shouldn't be providing a full reflection mechanism here, given both that one is coming anyway, and that whatever we design, people will inevitably ask for more, and we don't want to be maintaining our own reflection technology.

So, I think we should either roll back __builtin_dump_struct or fix forward. This patch attempted to do the latter, but maybe it's gone too far in the direction of reflection. I think we can address most of my concerns with __builtin_dump_struct without a new builtin, by incorporating things from this implementation as follows:

  • Desugar it in Sema rather than in CodeGen -- this is necessary to enable following things as well as for constant evaluation
  • Take any callable as the printing function and pass it the fields with their correct types, so that a C++ implementation can dispatch based on the type and print the values of types that we don't hard-code (eg, we'd generate calls like user_dump_function("%s = %p", "field_name", &p->field_name);, and a smart C++ formatting function can preserve the field type and do something suitable when formatting a corresponding %p).
  • Take additional arguments to be passed onto the provided function, so that C users can easily format to a different file or to a string or whatever else

The main concern that I think we can't address this way is that __builtin_dump_struct decides for itself how to format the struct. But if we keep that property, that might actually help to keep us in the bounds of a builtin that's good for dumping but nothing else?

yihanaa added inline comments.Apr 23 2022, 8:33 AM
clang/lib/Sema/SemaChecking.cpp
434

I think the user should be informed somehow that there is an unnamed bitfield here

Thank you for looking into this! I've also thought that the __builtin_dump_struct implementation hasn't been quite as robust as it could be. However, the primary use case for that builtin has been for kernel folks to debug when they have no access to an actual debugger -- it being a super limited interface is actually a feature rather than a bug, in some ways.

I'm more concerned about the design of __builtin_dump_struct than the implementation -- we can always fix the implementation. Right now the design is hostile to any use other than the very simplest one where the function you give it is essentially printf. You can't even dump to a log file or to a string without having to fight the design of the builtin. And if you want anything other than the exact recursion and formatting choices that we've arbitrarily made, you're out of luck. And of course, if you're in C++, you'll want to be able to print out std::string and std::optional<int> and similar types too, which are not supported by the current design at all. Perhaps this is precisely what a specific set of kernel folks want for a specific use case, but if the only purpose is to address that one use case, then I don't really see how we can justify keeping this builtin in-tree. So I think we should either expand the scope beyond the specific kernel folks or we should remove it.

Yes, I agree that the current builtin is limited and not particularly well designed for general use. I would be fine removing support for it. I'd be happier augmenting it so it works for more use cases. However, I don't think what you've got here so far "expands the scope" so much as "moves the goalposts" -- if the replacement can't be used by the kernel folks, then we're losing the primary use case we have for that builtin. However, it sounds like you may have some ideas on that so we can make it more usable in C.

I think what you're designing here is straight-up reflection, which is a different use case. If we're going to get something that's basically only usable in C++, I'm not certain there's much value in adding builtins for reflection until WG21 decides what reflection looks like, and then we can design around that. (Personally, I think designing usable reflection interfaces for C would be considerably harder but would provide considerably more benefit to users given the lack of reflection capabilities. There's almost no chance WG14 and WG21 would come up with the same interfaces for reflection, so I think we've got some opportunity for exploration here.)

What I set out to build is something that lets you implement struct dumping without all the shortcomings I see in __builtin_dump_struct. I think it's inevitable that that ends up being substantially a struct reflection system, and indeed __builtin_dump_struct is a reflection system, too, just a really awkward one -- people have already started parsing its format strings to extract struct field names, for example. (The usage I've seen actually *is* struct dumping, but that usage can't use __builtin_dump_struct directly because of its limitations, so __builtin_dump_struct is just used to extract the field names, and the field values and types are extracted via structured bindings instead.)

Agreed that the current builtin is effectively (really bad) reflection as well, and I'm sad to hear people are trying to parse its output to get structure field names (we have a JSON AST dump for exactly that kind of situation, so I'm not super sympathetic to people using builtin dump struct in that way unless they also need the runtime value information of the fields).

I do agree that we shouldn't be providing a full reflection mechanism here, given both that one is coming anyway, and that whatever we design, people will inevitably ask for more, and we don't want to be maintaining our own reflection technology.

Agreed.

So, I think we should either roll back __builtin_dump_struct or fix forward. This patch attempted to do the latter, but maybe it's gone too far in the direction of reflection. I think we can address most of my concerns with __builtin_dump_struct without a new builtin, by incorporating things from this implementation as follows:

  • Desugar it in Sema rather than in CodeGen -- this is necessary to enable following things as well as for constant evaluation

+1

  • Take any callable as the printing function and pass it the fields with their correct types, so that a C++ implementation can dispatch based on the type and print the values of types that we don't hard-code (eg, we'd generate calls like user_dump_function("%s = %p", "field_name", &p->field_name);, and a smart C++ formatting function can preserve the field type and do something suitable when formatting a corresponding %p).

+1, though I'm curious what we want to do about things like bit-fields (do we want to alert the caller that it's a bit-field and what the width is?), anonymous bit-fields, base classes vs fields of class type, etc.

  • Take additional arguments to be passed onto the provided function, so that C users can easily format to a different file or to a string or whatever else

+1, that seems pretty sensible for what is effectively a callback function.

The main concern that I think we can't address this way is that __builtin_dump_struct decides for itself how to format the struct. But if we keep that property, that might actually help to keep us in the bounds of a builtin that's good for dumping but nothing else?

I think that could be workable. Given that it was intended as a debugging interface, I don't think the format of the struct is critical so long as the default format isn't too awful (for some definition of "too awful") and we provide at least the same amount of information as before.

Thank you for looking into this! I've also thought that the __builtin_dump_struct implementation hasn't been quite as robust as it could be. However, the primary use case for that builtin has been for kernel folks to debug when they have no access to an actual debugger -- it being a super limited interface is actually a feature rather than a bug, in some ways.

I'm more concerned about the design of __builtin_dump_struct than the implementation -- we can always fix the implementation. Right now the design is hostile to any use other than the very simplest one where the function you give it is essentially printf. You can't even dump to a log file or to a string without having to fight the design of the builtin. And if you want anything other than the exact recursion and formatting choices that we've arbitrarily made, you're out of luck. And of course, if you're in C++, you'll want to be able to print out std::string and std::optional<int> and similar types too, which are not supported by the current design at all. Perhaps this is precisely what a specific set of kernel folks want for a specific use case, but if the only purpose is to address that one use case, then I don't really see how we can justify keeping this builtin in-tree. So I think we should either expand the scope beyond the specific kernel folks or we should remove it.

Yes, I agree that the current builtin is limited and not particularly well designed for general use. I would be fine removing support for it. I'd be happier augmenting it so it works for more use cases. However, I don't think what you've got here so far "expands the scope" so much as "moves the goalposts" -- if the replacement can't be used by the kernel folks, then we're losing the primary use case we have for that builtin. However, it sounds like you may have some ideas on that so we can make it more usable in C.

In general I've not been a fan of __builtin_dump_struct (I wasn't paying attention when it got merged), but I was assured in the meantime that it was JUST for printf-debugging in places without a debugger. I found its inflexibility, changing format, and inconsistent functionality to be a 'feature' at that point, as it would discourage non-printf-debugging use cases. BUT I guess you're seeing Hyrum's law in action :D It would be a shame to LOSE the builtin, since it is apparently REALLY useful in the debugging case, but I definitely don't think I would have +1'ed it if I were to see/understand it as a new addition.

I think what you're designing here is straight-up reflection, which is a different use case. If we're going to get something that's basically only usable in C++, I'm not certain there's much value in adding builtins for reflection until WG21 decides what reflection looks like, and then we can design around that. (Personally, I think designing usable reflection interfaces for C would be considerably harder but would provide considerably more benefit to users given the lack of reflection capabilities. There's almost no chance WG14 and WG21 would come up with the same interfaces for reflection, so I think we've got some opportunity for exploration here.)

What I set out to build is something that lets you implement struct dumping without all the shortcomings I see in __builtin_dump_struct. I think it's inevitable that that ends up being substantially a struct reflection system, and indeed __builtin_dump_struct is a reflection system, too, just a really awkward one -- people have already started parsing its format strings to extract struct field names, for example. (The usage I've seen actually *is* struct dumping, but that usage can't use __builtin_dump_struct directly because of its limitations, so __builtin_dump_struct is just used to extract the field names, and the field values and types are extracted via structured bindings instead.)

Agreed that the current builtin is effectively (really bad) reflection as well, and I'm sad to hear people are trying to parse its output to get structure field names (we have a JSON AST dump for exactly that kind of situation, so I'm not super sympathetic to people using builtin dump struct in that way unless they also need the runtime value information of the fields).

I agree completely, over the weekend I also came to the conclusion that any attempt to do ANYTHING in this space is likely going to be used as a poor-man's reflection. At that point, I consider the awkwardness to be a bit of a 'feature' here, as I'd hope anyone trying to use it for its not-intended-purpose would at one point pause thinking whether they COULD, and think briefly whether they SHOULD.

I do agree that we shouldn't be providing a full reflection mechanism here, given both that one is coming anyway, and that whatever we design, people will inevitably ask for more, and we don't want to be maintaining our own reflection technology.

Agreed.

I VERY much don't want to be spending cycles maintaining our own reflection, part of why I am/was so concerned with __builtin_reflect_struct. Though, I'm coming to terms with the fact that __builtin_dump_struct _IS_ 'crappy reflection' already.

So, I think we should either roll back __builtin_dump_struct or fix forward. This patch attempted to do the latter, but maybe it's gone too far in the direction of reflection. I think we can address most of my concerns with __builtin_dump_struct without a new builtin, by incorporating things from this implementation as follows:

  • Desugar it in Sema rather than in CodeGen -- this is necessary to enable following things as well as for constant evaluation

+1

I agree here, I found the implemented-in-codegen bizarre and limiting, so even this would be an improvement without the others.

  • Take any callable as the printing function and pass it the fields with their correct types, so that a C++ implementation can dispatch based on the type and print the values of types that we don't hard-code (eg, we'd generate calls like user_dump_function("%s = %p", "field_name", &p->field_name);, and a smart C++ formatting function can preserve the field type and do something suitable when formatting a corresponding %p).

+1, though I'm curious what we want to do about things like bit-fields (do we want to alert the caller that it's a bit-field and what the width is?), anonymous bit-fields, base classes vs fields of class type, etc.

I suspect we'd want some sort of 'depth' parameter to that as well so we can handle the 'tabbing' correctly, and am also concerned with the things Aaron brought up (particularly the anonymous bit-fields/zero length bitfields), as they seem to be things that make current users persnickety.

  • Take additional arguments to be passed onto the provided function, so that C users can easily format to a different file or to a string or whatever else

+1, that seems pretty sensible for what is effectively a callback function.

I am somewhat concerned with these as they get confusing/obtuse to use sometimes (particularly with variadic functions as the callback, where they cause strange non-compile-time errors with minor, seemingly innocuous changes), but am reasonably sure we could come up with some level of design that would make it not-awful. If this was JUST C++, I'd say "we should not, and the user should use a functor/lambda for state", but, of course, this is C :/

The main concern that I think we can't address this way is that __builtin_dump_struct decides for itself how to format the struct. But if we keep that property, that might actually help to keep us in the bounds of a builtin that's good for dumping but nothing else?

I think that could be workable. Given that it was intended as a debugging interface, I don't think the format of the struct is critical so long as the default format isn't too awful (for some definition of "too awful") and we provide at least the same amount of information as before.

Yep, agreed. The guarantee that we give is "human readable", anything we do beyond that encourages folks trying to see if it is "computer readable", which is immediately "bad reflection". The BOFH-compiler-guy part of me (B-Compiler-Dev-FH ?) thinks we should start randomizing the output to discourage this behavior, but I suspect that would be an arms-race we would ultimately lose.

yihanaa added a comment.EditedApr 25 2022, 9:04 AM

Thank you for inviting me to a discussion on this topic. I have something very confusing as follows:

  • If the developer already knows all fields info in some struct, wyh doesn't he write a simple function to do the printing(like: print_struct_foo(const struct foo *); )? It might be easier to write a printing function.
  • If the struct has many levels of nesting, and the developers just want to output the details of the struct for debugging purposes, then I think they might just need something like the 'p' command in LLDB.
  • If we want to support these printing functions, and make this builtin flexible and general, I think reflection is a good idea. But we may want to make this builtin support C/C++ or other languages supported by Clang at the same time. I have a not-so-sophisticated (perhaps silly-sounding) idea: this builtin might be one or more C-style functions in order to be usable in many different languages. This family of functions supports getting the number of struct fields, as well as their details (and possibly getting the value of the field by getting the subscript). like:
int __builtin_reflect_struct_member_count(const void *struct_addr, unsigned *count);
int __builtin_reflect_struct_member_detail(const void *struct_addr, unsigned index, char *name, unsigned *bitfield, int *is_unnamed);
int __builtin_reflect_struct_member_ptr(const void *struct_addr, unsigned index, void **ptr);

Maybe it also can support nested struct?

emmmmm, my idea looks really stupid, don't laugh at me, hahahaha~~~😂😂😂😂
I'm really looking forward to hearing your idea💡

  • Take any callable as the printing function and pass it the fields with their correct types, so that a C++ implementation can dispatch based on the type and print the values of types that we don't hard-code (eg, we'd generate calls like user_dump_function("%s = %p", "field_name", &p->field_name);, and a smart C++ formatting function can preserve the field type and do something suitable when formatting a corresponding %p).

+1, though I'm curious what we want to do about things like bit-fields (do we want to alert the caller that it's a bit-field and what the width is?), anonymous bit-fields, base classes vs fields of class type, etc.

I suspect we'd want some sort of 'depth' parameter to that as well so we can handle the 'tabbing' correctly,

I was thinking that we'd pass hard-coded " " strings to the user_dump_function for indentation (like we do now), rather than trying to make it customizable, so that, in particular, you can still pass in printf as the user_dump_function and have it "just work".

and am also concerned with the things Aaron brought up (particularly the anonymous bit-fields/zero length bitfields), as they seem to be things that make current users persnickety.

Yeah. User concerns about the output format not being what they wanted really reinforce for me the fundamental problems with this builtin. But we can still make reasonable and principled choices here (eg, show the same information that would be in a corresponding compound literal).

  • Take additional arguments to be passed onto the provided function, so that C users can easily format to a different file or to a string or whatever else

+1, that seems pretty sensible for what is effectively a callback function.

I am somewhat concerned with these as they get confusing/obtuse to use sometimes (particularly with variadic functions as the callback, where they cause strange non-compile-time errors with minor, seemingly innocuous changes), but am reasonably sure we could come up with some level of design that would make it not-awful. If this was JUST C++, I'd say "we should not, and the user should use a functor/lambda for state", but, of course, this is C :/

I definitely understand that concern. I think it's important to allow at least one parameter -- I'd be OK with saying that you can pass either the signature of printf or that signature with a leading void*, but I think it's nicer to say we'll take whatever extra arguments are given and blindly pass them onto the given function, so that we can directly support fprintf and dprintf. (You'd still need a wrapper in order to be able to call snprintf, though, in order to move the pointer forward and correspondingly decrease the remaining buffer length.) But as long as we provide *some* way to pass arbitrary information to the callback, I think that satisfies the use case.

The BOFH-compiler-guy part of me (B-Compiler-Dev-FH ?) thinks we should start randomizing the output to discourage this behavior, but I suspect that would be an arms-race we would ultimately lose.

:-) I think the best thing we can do to discourage people from using this dumping facility for reflection is to provide builtins to implement the C++ reflection functionality as soon as we can after we think it's ready -- and ideally to make sure that those builtins support at least basic use cases in C. (The latter might require that we wait until C's constant evaluation functionality improves, but at least some in WG14 are pushing hard for that.)

rsmith added inline comments.Apr 25 2022, 1:23 PM
clang/lib/Sema/SemaChecking.cpp
434

It seems to me that __builtin_dump_struct is displaying the value of a struct object, not the representation of the object, and unnamed bitfields are not a part of the value, so should not be included. And I think it makes sense for that builtin to be value-oriented: someone using it presumably already knows whatever they want to know about the representation, to the extent that it matters, or can look it up; what they don't know is the information that varies from one instance to another. If we want a builtin that's more oriented around showing the struct's memory representation, I think we'd want quite a different output format, including offsets for fields -- and even then I don't think it makes sense to include unnamed bit-fields because they're not different from any other kind of padding in the object. If we still want to print unnamed bit-fields for some reason, presumably we should also print alignment and packing attributes and pragmas, because they too can change the padding within a struct.

aaron.ballman added inline comments.Apr 27 2022, 12:22 PM
clang/lib/Sema/SemaChecking.cpp
434

I tend to agree that __builtin_dump_struct should be focused on the values in the structure and not the representation.

However, keeping in mind that the original builtin was for kernel folks and the kernel tends to do some pretty interesting things (that aren't always supported by the standard), there's a part of me that wonders if they have a use case for unnamed bit-fields where they expect data to be squirreled away in those bits (through memcpy() or some other means) that they'd like to be able to read.

rsmith updated this revision to Diff 426345.May 2 2022, 12:46 AM
  • Reimplement __builtin_dump_struct in Sema.
rsmith retitled this revision from Add new builtin __builtin_reflect_struct. to Reimplement `__builtin_dump_struct` in Sema..May 2 2022, 12:47 AM
rsmith edited the summary of this revision. (Show Details)

I think we can address most of my concerns with __builtin_dump_struct without a new builtin, by incorporating things from this implementation

Done; this patch now reimplements __builtin_dump_struct using the Sema desugaring approach rather than providing a new builtin. As in the prior patch, we still support passing an arbitrary callable and additional arguments, and in C++ we support constant-evaluation of the dump as well as generic printing of field values even for types that the builtin doesn't natively understand. But the interface remains simple enough that you can just call __builtin_dump_struct(&s, printf) or __builtin_dump_struct(&s, fprintf, stderr) and get a dump to stdout / stderr.

rsmith updated this revision to Diff 426347.May 2 2022, 12:53 AM
rsmith edited the summary of this revision. (Show Details)
  • Fix example in documentation: the dump is terminated by a newline.
rsmith updated this revision to Diff 426348.May 2 2022, 12:54 AM
  • Minor doc correction
rsmith updated this revision to Diff 426349.May 2 2022, 1:01 AM
  • Update release notes
yihanaa added inline comments.May 2 2022, 1:19 AM
clang/lib/Sema/SemaChecking.cpp
441

What do you think of PrintingPolicy.Indentation here?

447

I think this is better maintained in "clang/AST/FormatString.h". For example analyze_printf::PrintfSpecifier can get format specifier, what do you all think about?

yihanaa added inline comments.May 2 2022, 3:03 AM
clang/docs/ReleaseNotes.rst
174–182

I tried it, D124221 will still generate anonymous tag locations, this place needs to be corrected, or removed anonymous tag locations.

int main() {

struct U20A {
  _Bool a;
  unsigned : 32;
  struct {
    int x;
    int y;
  } c;
};

struct U20A a = {
    .a = 1,
    .c.x = 1998
};
  __builtin_dump_struct(&a, printf);

}

struct U20A {

_Bool a = 1
struct (unnamed struct at ./builtin_dump_struct.c:10:5) c = {
    int x = 1998
    int y = 0
}

}

Thanks for working on this, I like the direction it's going!

clang/include/clang/Basic/DiagnosticSemaKinds.td
8470–8473

Should we combine these into:

def err_builtin_dump_struct_expects : Error<
  "expected %select{pointer to struct|a callable expression}0 as %ordinal1 argument to %2, found %3">;

to reduce the number of diagnostics we have to tablegen?

clang/lib/AST/Expr.cpp
2742

I don't think the changes here are incorrect, but I'm wondering if they should be split into a different patch as they seem to be largely unrelated to the builtin?

clang/lib/Sema/SemaChecking.cpp
404

Haha, neat trick. Perhaps someday we should allow getPredefinedStringLiteralFromCache() to specify a SourceLocation?

411

Can you add an assertion that the call has at least two args?

447

+1 to this suggestion -- my hope is that we could generalize it more then as I notice there are missing specifiers for things like intmax_t, size_t, ptrdiff_t, _Decimal types, etc. Plus, that will hopefully make it easier for __builtin_dump_struct to benefit when new format specifiers are added, such as ones for printing a _BitInt.

514

Can you spell out the type since it's not in the initialization?

521
586–588

Should we handle array types specifically, or should we continue to punt on those for now?

639–640

ObjC block?

This is a pretty nice seeming interface that I think does a fairly good job at maintaining backwards compat while improving the functionality. A few questions/comments.

clang/docs/LanguageExtensions.rst
2442

I don't know if anyone would be using this value, but I wonder if there is value to making this a 'sum' of the results of f? Less useful for the purposes of printf, but perhaps to a sprintf-type-function?

Not sure if we have the use cases to support this however.

clang/lib/Sema/SemaChecking.cpp
572
609

What is happening here? I would expect the opposite to need to happen here (wrapper be set to match the call, not the other way around?).

617

This is unfortunate. Do we really not have a checkAtLeastArgCount type function here? I know we have something similar for attributes.

628

Is this sufficient? Couldn't this be a pointer to a union? Do you mean the suggestion?

This implementation looks good to me😁

yihanaa added inline comments.May 2 2022, 11:08 AM
clang/lib/Sema/SemaChecking.cpp
587

Can we generate a warning diagnostic message when we see an unsupported type (perhaps temporarily not supported), what do you all think about?

Do we need to do argument promotion here,ahh...maybe i understood wrong

rsmith updated this revision to Diff 426517.May 2 2022, 1:58 PM
rsmith marked 5 inline comments as done.
  • Respond to review comments.
rsmith updated this revision to Diff 426520.May 2 2022, 2:27 PM
  • Use printing policy more, and turn off anonymous tag locations.
rsmith updated this revision to Diff 426525.May 2 2022, 2:45 PM
rsmith marked 3 inline comments as done.
  • Add requested assert.
clang/docs/LanguageExtensions.rst
2442

I don't think you can use this directly with sprintf -- you'd need a wrapper to update the write position as part of each call to f, and that wrapper can handle accumulating the value returned from sprintf. In order to do that, you'd want to provide an additional input argument that can be modified as the dump progresses, and that can be used to return the overall (desired) size too. So I don't think even the sprintf case provides motivation for this, and it would add complexity (as well as making the builtin less usable in the case where the printing function doesn't return something addable).

I could see this being at least a little useful for a formatting function that returns std::string or similar, but ... I think it's probably better to let the formatting function deal with that itself through argument passing instead of assuming that we can apply + to the results. WDYT?

clang/docs/ReleaseNotes.rst
174–182

Thanks, fixed.

clang/include/clang/Basic/DiagnosticSemaKinds.td
8470–8473

I'm indifferent. That kind of refactoring will probably decrease the chance of these being reused for any other purpose, which might end up costing us more diagnostics, and will need a magic number in the source code to select between alternatives. Happy to do it if you think it's worthwhile, but I don't have any measurements for how much each diagnostic ID costs.

clang/lib/AST/Expr.cpp
2742

Yeah, will do. I think these changes may actually turn out to not be necessary for __builtin_dump_struct, because it returns void (they were necessary for __builtin_reflect_struct`).

clang/lib/Sema/SemaChecking.cpp
404

The desire for caching and the desire for a source location are a bit at odds here, unfortunately. Also unfortunate: if you actually put a source location on a StringLiteral, other parts of Clang (especially relevant for this patch: format string checking!) will assume that there's a string literal token that they can re-lex from that location, and will get very unhappy if there isn't. Perhaps a no-op LocatedExpr that just wraps an expression and gives it a different location would work, but of course we'd need to teach various things to look through it.

441

It looks like nothing sets that outside of libclang, so I don't think that'd be especially useful, but it seems fine to me to use it here. We're not exactly pretty-printing source code, but that doesn't seem like a problem. Done.

447

I am somewhat uncertain: every one of these is making arbitrary choices about how to format the value, so it's not clear to me that this is general logic rather than something specific to __builtin_dump_struct. For example, using %f rather than one of the myriad other ways of formatting double is somewhat arbitrary. Using %s for any const char* is *extremely* arbitrary and will be wrong and will cause crashes in some cases, but it may be the pragmatically correct choice for a dumping utility. A general-purpose mechanism would use %p for all kinds of pointer.

We could perhaps factor out the formatting for cases where there is a clear canonical default formatting, such as for integer types and probably %p for pointers, then call that from here with some special-casing, but without a second consumer for that functionality it's really not clear to me what form it should take.

586–588

I'd prefer to punt for now, if that's OK. I don't think this is hard to add, but it's probably clearer to do it in a separate patch. There are also some nontrivial questions there, eg: should we dump all elements or stop after some limit? Should we dump trailing zeroes in an integer array? How should we dump arrays of char? I'd prefer to not deal with those questions in this patch :)

587

We could, but I'm leaning towards thinking that we shouldn't warn by default at least. That would get in the way of the C++ use case where the type actually is supported by the printing function; I'm not sure that we can easily tell those cases apart. And it's probably not all that useful to someone who just wants whatever information we can give them while debugging. I'd be OK with adding a warning that's off by default, though. What would be the intended use case here? Would an off-by-default warning satisfy it?

609

This is the normal way we handle builtins with custom type checking: the SemaChecking logic analyzes the builtin and sets the result type of the call as appropriate, because the type we get from Builtins.def is meaningless. As it happens, this is a degenerate case (compared to __builtin_reflect_struct where this did something more interesting), because the type of the PseudoObjectExpr is always void here; I can just set that directly if you think that's clearer. It shouldn't matter much anyway, given that the call is only used as the syntactic form of the PseudoObjectExpr, so its type doesn't really affect anything. but following the type and value kind of the PseudoObjectExpr seems "right" to me.

617

We do now. :)

628

Allowing this is what the existing builtin did, and there were tests for it.

I'm not sure what the best thing to do with unions is -- especially when a union appears as a struct member. Perhaps the least bad approach is to print out all the union members and hope that nothing goes wrong. That'll crash in some cases (eg, union { const char *p; int n; }; when the int is the active member), but that seems to be the nature of this builtin.

I've added a FIXME to dumpRecordValue.

aaron.ballman added inline comments.May 3 2022, 6:00 AM
clang/include/clang/Basic/DiagnosticSemaKinds.td
8470–8473

Eh, let's skip it for now. I bring it up on occasion because we do run out of diagnostic IDs on occasion, but that's more a systemic concern.

clang/lib/Sema/SemaChecking.cpp
586–588

Absolutely fine to punt on it, because I agree there's a bunch of questions to answer there. I was also thinking about what to do with flexible array members or other trailing array objects.

587

So long as we don't trigger UB, I think it's fine to not diagnose for unsupported types. But the UB cases do strike me as something we might want to warn about if we think they can be hit (specifically, these kinds of UB: https://eel.is/c++draft/cstdarg.syn#1).

As an example, do we properly handle dumping this case:

struct S {
  int &i;
  S(int &ref) : i(ref) {}
};

where the vararg argument is of reference type? I suppose another interesting question is:

struct S {
  int i;
};

void func() {
  S s;
  [=]() {
    __builtin_dump_struct(&s, ::printf);
  };
}

(is S.i "an entity resulting from a lambda capture"?)

erichkeane added inline comments.May 3 2022, 6:06 AM
clang/docs/LanguageExtensions.rst
2442

I agree that we're probably better off just making the user write a wrapper function/functor to do so. I was thinking about C folks, but you're right, they could pass a state object into their extended args.

I think we probably should just hold off on any attempts to do this until we find someone with a bonafide need.

clang/lib/Sema/SemaChecking.cpp
609

Ah, I see some other examples now, thanks for the explanation. I don't think setting to void is a good idea, in case we find a good reason to change the type; this ends up being just one more place we need to mess with it.

628

SGTM! We DO document a few times IIRC that we accept a 'struct' param(in fact, the comment on line 637 n ow says we do!), so I wanted to make sure we are consistent there.

rsmith updated this revision to Diff 426819.May 3 2022, 1:30 PM
rsmith marked 9 inline comments as done.
  • Rebase.
  • Reuse Clang's existing "guess a format specifier" logic.
  • Format signed/unsigned/plain char as numbers.
  • Add test for reference members.
rsmith added inline comments.May 3 2022, 1:33 PM
clang/lib/Sema/SemaChecking.cpp
447

I went ahead and did this, mostly to match concurrent changes to the old implementation. There are a few cases where our existing "guess a format specifier" logic does the wrong thing for dumping purposes, which I've explicitly handled -- things like wanting to dump a char / signed char / unsigned char member as a number rather than as a (potentially non-printable or whitespace) character.

587

The only things we pass to the formatting function are:

  • The small number of builtin types for which we have known formatting specifiers.
  • Pointers to void, using %p.
  • Pointers to [const] char, using %s.
  • Pointers to user-defined types, using *%p.

All of these can be passed through a ... without problems.

We can potentially introduce the following sources of UB:

  • Indirection through a struct pointer that doesn't point to a struct object.
  • Read of an inactive union member.
  • Read of an uninitialized member.
  • Printing something other than a pointer to a nul-terminated byte sequence with %s.

I'm not sure how much we should be worrying about those. It'd be nice to avoid them, though I don't think we can fix the fourth issue without giving up the %s formatting entirely.

Reference types aren't on our list of types with known formatting specifiers, so in your first example we'd call printf("%s%s = *%p", "int &", "i", &p->i), producing output like int & i = *0x1234abcd. We could special-case that to also dereference the reference and include the value (int & i = *0x1234abcd (5), but for now I'm treating it like any other type for which we don't have a special formatting rule. I've added test coverage for this case.

In your second example, we rewrite into ::printf("%s%s = %s", "int", "i", (&s)->i), which is fine. The parmN restriction is about calls to va_start:

void f(int a, ...) {
  [&] {
    va_list va;
    va_start(va, a); // error, `a` is captured here
    // ...
  };
}
yihanaa added inline comments.May 4 2022, 4:32 AM
clang/lib/Sema/SemaChecking.cpp
447

When I was patching that old implementation, I found that for uint8_t, int8_t, Clang's existing "guess a format specifier" logic would treat the value as an integer, but for unsigned char, signed char, char types, it would Treat it as a character, please look at this example ( https://godbolt.org/z/ooqn4468T ), I guess this existing logic may have made some special judgment.

rsmith added inline comments.May 4 2022, 11:10 AM
clang/lib/Sema/SemaChecking.cpp
447

Yeah. I think in the case where we see some random call to printf, %c is probably the right thing to guess here, but it doesn't seem appropriate to me to use this in a dumping routine. If we could dump as 'x' for printable characters and as '\xAB' for everything else, that'd be great, but printf can't do that itself and I'm not sure we want to be injecting calls to isprint or whatever to make that work. Dumping as an integer always seems like it's probably the least-bad option.

Somewhat related: I wonder if we should use "\"%s\"" instead of simply "%s" when dumping a const char*. That's not ideal but probably clearer than the current dump output.

erichkeane added inline comments.May 4 2022, 11:11 AM
clang/lib/Sema/SemaChecking.cpp
447

I see value to having strings with SOME level of delimiter, if at least to handle cases when the string itself has a newline in it.

yihanaa added inline comments.May 4 2022, 10:02 PM
clang/lib/Sema/SemaChecking.cpp
447

This looks good, but the string may need to be escaped (if there are some special characters in the original string)

yihanaa added inline comments.May 4 2022, 10:25 PM
clang/lib/Sema/SemaChecking.cpp
447

Maybe users should wrap the printf function, and then pass the wrapper function as a parameter, so that they can do some custom things

aaron.ballman accepted this revision.May 5 2022, 4:44 AM

Aside from a minor nit, I think this LGTM. Thank you for working on this!

clang/lib/Sema/SemaChecking.cpp
447

Somewhat related: I wonder if we should use "\"%s\"" instead of simply "%s" when dumping a const char*. That's not ideal but probably clearer than the current dump output.

I'd be in favor of that. (I sort of wonder whether we do want to inject calls to isprint() though -- it seems like escaping nonprintable characters would be the kindest option for both characters and strings.)

476

Note for future reference: I don't think fixType() does anything special to help us out with arrays.

507

My suggestion is to print the value of all fields and treat any string-like types as %p rather than %s. I'm fine with this being done later.

542
587

Thank you for the explanation, it sounds like we're doing about as good as we can get here in terms of safety.

FWIW, I like the idea of outputting the value for a reference type, but am fine punting on that for now (it could get awkward to format when the referenced type is a structure, for example).

This revision is now accepted and ready to land.May 5 2022, 4:44 AM
yihanaa accepted this revision.May 5 2022, 5:25 AM
erichkeane accepted this revision.May 5 2022, 6:29 AM

FWIW, I'm in favor of the patch as it sits.

As a followup: So I was thinking about the "%s" specifier for string types. Assuming char-ptr types are all strings is a LITTLE dangerous, but more so the way we're doing it. Its a shame we don't have some way of setting a 'max' limit to the number of characters we have for 2 reasons:

1- For safety: If the char-ptr points to non-null-terminated memory, it'll stop us from just arbitrarily printing into space by limiting at least the NUMBER of characters we print into nonsense.
2- For readability: printing a 'long' string likely makes this output look like nonsense and breaks everything up. Limiting us to only a few characters is likely a good idea.
3- <Bonus #3 from @aaron.ballman >: It might discourage SOME level of attempts at using this for reflection, or at least make it a little harder.

What I would love would be for something like a 10 char max:

struct S {
   char *C;
 };
 S s { "The Rest of this string is cut off"};
 print as:
 struct U20A a = {
   .c = 0x1234 "The Rest o"
 };

Sadly, I don't see something like that in printf specifiers? Unless someone smarter than me can come up with some trickery. PERHAPS have the max-limit compile-time configurable, but I don't feel strongly.

FWIW, I'm in favor of the patch as it sits.

As a followup: So I was thinking about the "%s" specifier for string types. Assuming char-ptr types are all strings is a LITTLE dangerous, but more so the way we're doing it. Its a shame we don't have some way of setting a 'max' limit to the number of characters we have for 2 reasons:

1- For safety: If the char-ptr points to non-null-terminated memory, it'll stop us from just arbitrarily printing into space by limiting at least the NUMBER of characters we print into nonsense.
2- For readability: printing a 'long' string likely makes this output look like nonsense and breaks everything up. Limiting us to only a few characters is likely a good idea.
3- <Bonus #3 from @aaron.ballman >: It might discourage SOME level of attempts at using this for reflection, or at least make it a little harder.

What I would love would be for something like a 10 char max:

struct S {
   char *C;
 };
 S s { "The Rest of this string is cut off"};
 print as:
 struct U20A a = {
   .c = 0x1234 "The Rest o"
 };

Sadly, I don't see something like that in printf specifiers? Unless someone smarter than me can come up with some trickery. PERHAPS have the max-limit compile-time configurable, but I don't feel strongly.

The C Standard has this in the specification of the %s format specifier:

If no l length modifier is present, the argument shall be a pointer to storage of character
type. Characters from the storage are written up to (but not including) the terminating
null character. If the precision is specified, no more than that many bytes are written. If
the precision is not specified or is greater than the size of the storage, the storage shall
contain a null character.

So you can use the precision modifier on %s to limit the length to a particular number of bytes. The only downside I can think of to picking a limit is, what happens when the user stores valid UTF-8 data in their string and prints it via %.10s (will we then potentially be splitting a codepoint in half and that does something bad?

FWIW, I'm in favor of the patch as it sits.

As a followup: So I was thinking about the "%s" specifier for string types. Assuming char-ptr types are all strings is a LITTLE dangerous, but more so the way we're doing it. Its a shame we don't have some way of setting a 'max' limit to the number of characters we have for 2 reasons:

1- For safety: If the char-ptr points to non-null-terminated memory, it'll stop us from just arbitrarily printing into space by limiting at least the NUMBER of characters we print into nonsense.
2- For readability: printing a 'long' string likely makes this output look like nonsense and breaks everything up. Limiting us to only a few characters is likely a good idea.
3- <Bonus #3 from @aaron.ballman >: It might discourage SOME level of attempts at using this for reflection, or at least make it a little harder.

What I would love would be for something like a 10 char max:

struct S {
   char *C;
 };
 S s { "The Rest of this string is cut off"};
 print as:
 struct U20A a = {
   .c = 0x1234 "The Rest o"
 };

Sadly, I don't see something like that in printf specifiers? Unless someone smarter than me can come up with some trickery. PERHAPS have the max-limit compile-time configurable, but I don't feel strongly.

The C Standard has this in the specification of the %s format specifier:

If no l length modifier is present, the argument shall be a pointer to storage of character
type. Characters from the storage are written up to (but not including) the terminating
null character. If the precision is specified, no more than that many bytes are written. If
the precision is not specified or is greater than the size of the storage, the storage shall
contain a null character.

So you can use the precision modifier on %s to limit the length to a particular number of bytes. The only downside I can think of to picking a limit is, what happens when the user stores valid UTF-8 data in their string and prints it via %.10s (will we then potentially be splitting a codepoint in half and that does something bad?

Ah! TIL! That I think would be an excellent improvement to this builtin. I suspect we could just choose a fixed value to start (maybe even in this patch @rsmith ?), then add a compile-time option in the future if folks care to extend it. I would posit that 10-15 would be a good start? Any shorter would be prohibitive I think? I could be talked into a little longer...

I found LLDB have this feature. LLDB uses "..." to indicate that only part of the string is now printed, and it also have a command to modify length limit. This approach is similar to the idea of @erichkeane, can we do the same?what do you all think?

typedef signed char int8_t;
typedef unsigned char uint8_t;

int printf(const char *, ...);

int main() {
  struct U20A {
    const char *str;
  };

  struct U20A a = {
      .str = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
  };
    __builtin_dump_struct(&a, &printf);
}
(lldb) p a.str
(const char *) $1 = 0x000000010000329a "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...

I was thinking of that, but I don't have a good way using just format strings to get us to print the '...' after. I think if someone wants that part of the feature, they'll have to wrap printf/etc in a wrapper that does that.

This is unfortunately true, unless we generate some code that calculates the length of the string at runtime. I find it difficult to draw the line between defining this built-in functionality and developers developing printf-like functionality, if built-in does too much, flexibility will decrease, if built-in does less, (features like length limits) then develop personnel may need to do some work. o(≧口≦)o

rsmith updated this revision to Diff 427463.May 5 2022, 2:53 PM
rsmith marked an inline comment as done.
  • Put ""s around %s formats and add a length limit.
This revision was landed with ongoing or failed builds.May 5 2022, 2:56 PM
This revision was automatically updated to reflect the committed changes.
erichkeane added inline comments.May 6 2022, 6:00 AM
clang/docs/LanguageExtensions.rst
2436

For post-commit: perhaps the string quoting and limit should be documented?

clang/test/CodeGen/builtin-dump-struct.c