Page MenuHomePhabricator

Introduce llvm FormatVariadic

Authored by zturner on Oct 13 2016, 5:19 PM.



As discussed on the mailing list, here is my initial pass at the variadic formatting library.

Formatters are provided for integral, character, string, and pointer types. More complex formatters such as those required for chrono classes are not yet provided, but they are lower priority.

Note that this patch is blocked on D25497, so while I welcome reviews to this any time, it can't go in until that one is approved.

Adding a bunch of people that expressed interest in this from the list.

Diff Detail


Event Timeline

zturner updated this revision to Diff 74602.Oct 13 2016, 5:19 PM
zturner retitled this revision from to Introduce llvm FormatVariadic.
zturner updated this object.
zturner added a subscriber: llvm-commits.

I should have said formatters are provided for *numeric* types. There are float / double formatters as well.

zturner updated this revision to Diff 74698.Oct 14 2016, 9:09 AM

Updated tests to use C++11 user-defined literals. Also added a format provider for type bool and associated tests.

zturner updated this revision to Diff 74726.Oct 14 2016, 11:58 AM

Changes from previous revisions:

  1. Added a format_provider for llvm::iterator_range<T>. Fails to compile if there is no known formatter for T. Style options for the iterator_range provider allow you to specify the separator as well as the Style options for the underlying type, which get passed through.
  1. Added a test demonstrating the use of a format adapter (i.e. the second way to customize formatting behavior) by making it print the negative of whatever number you give. The example here is trivial, but the idea is just to illustrate that format_provider and adapters both work. In practice you could make a much more complicated adapter.

I think this is done for now until I get some feedback.

inglorion added inline comments.
144 ↗(On Diff #74726)

We should probably also document here how one gets literal braces.

42 ↗(On Diff #74726)

I would be more comfortable if we treated single, unbalanced braces as errors. If we specify a brace as starting a replacement pattern and a double brace as the specifier of a literal brace, the meaning of something like "{0" is undefined and it's not clear what behavior was intended. Is there a way we report this to the programmer, so that they can fix it to be "{{0" or "{0}" or whatever they really meant?

inglorion added inline comments.Oct 14 2016, 2:26 PM
152 ↗(On Diff #74726)

Do you have a use case in mind for having a replacement sequence without an index, with or without the other fields?

zturner added inline comments.Oct 14 2016, 2:44 PM
152 ↗(On Diff #74726)

Not specifically, it was a choice between replacing with an empty string vs leaving it in the string as a literal.

This does raise an interesting point though, which is that I'm not currently consistent. Sometimes I replace with empty string, sometimes I leave the invalid specifier in un-replaced. Perhaps we should be consistent. Always delete from the output, or always leave it in exactly.

zturner updated this revision to Diff 74744.Oct 14 2016, 2:54 PM

Improved some comments and documentation. Also assert when there is an unterminated open brace in the format string, suggested by @inglorion

MatzeB added a subscriber: MatzeB.Oct 14 2016, 3:23 PM
inglorion added inline comments.Oct 14 2016, 3:24 PM
152 ↗(On Diff #74726)

I would say this is another case where the intended behavior is not clear, so we should treat it as an error. Compared to picking one of those behaviors now and specifying that as expected, treating it as an error has the advantage that if we later come up with some clearly useful behavior for the case, we can change the specified behavior from error to whatever it is we want it to do, and it will not break any previously correct code.

inglorion added inline comments.Oct 14 2016, 3:39 PM
196 ↗(On Diff #74744)

While we're at it, we should probably do the same thing for closing braces, too. I think having symmetry between {{ and }} is a little nicer than having {{ and }, and it also matches what C# and Python do.

zturner updated this revision to Diff 75034.Oct 18 2016, 10:47 AM

Changes from previous revision.

  1. Formalized the grammar of a replacement sequence.
  2. Simplified the parsing code to adhere to the grammar and better assert on error sequences.
  3. Added a few non-trivial adapters to demonstrate their use (including adapter composition).

I have a couple of questions and comments inline. Thanks for making the changes I suggested earlier!

20 ↗(On Diff #75034)

Can you make this class instead of struct? The classes that extend it are all declared as class, and you are explicitly specifying protected anyway, so might as well make it more consistent.

26 ↗(On Diff #75034)

Is there any way to get the name of the type in the message?

32 ↗(On Diff #75034)

That seems like a useful change to make. Want to make it before committing this? The amount of code that is going to have to be changed for it is only going to get larger.

45 ↗(On Diff #75034)

In general for code that dispatches on an enum, if you convert it to switch instead of if, the compiler can warn if the code doesn't handle all the cases. Clang doesn't seem to do this for if ... else if, so I prefer code like this to be written using switch.

21 ↗(On Diff #75034)

Please remove.

32 ↗(On Diff #75034)

Instead of explicitly enumerating integral types, can we use something like is_integral<T>::value && !use_char_formatter<T>::value here?

37 ↗(On Diff #75034)

Should this support unsigned char, too? What about wchar_t?

87 ↗(On Diff #75034)

For styles other than hex and exponential, is there a difference between the uppercase and lowercase specifiers? If not, consider only supporting one so that we get consistent usage in the code base.

250 ↗(On Diff #75034)

In cases like these where \brief is used with a single-sentence paragraph, I would leave out the \brief - looks a little cleaner. It's also consistent with

72 ↗(On Diff #75034)

I don't follow - why do we have to check IsNegative when T is an unsigned type?

zturner updated this revision to Diff 76434.Oct 31 2016, 9:42 AM

I was blocked on this for a while due to a miscompile in the AArch64 backend for some preliminary work I wanted to get in. I got unblocked on that, so I'm updating this diff now to take everything into account. Will go back now and respond to missed comments.

In the meantime, if you haven't reviewed this and you are interested in this formatting library, I would suggest at least taking a look at FormatVariadic.h and FormatProviders.h. There are large comment blocks in those files that formalize the grammar of the the replacement sequences etc. So even if you don't review the code, commenting on the grammar description would be equally helpful.

zturner added inline comments.Oct 31 2016, 9:52 AM
26 ↗(On Diff #75034)

That's actually the whole reason for this. You will get the name of the type because T is the type, and most compilers will print out what T is.

32 ↗(On Diff #75034)

I thought about it, but honestly I'd rather do it later. It's purely an optimization, and once this is in I can make that change via post-commit review.

32 ↗(On Diff #75034)

is_integral also includes bool. So you'd also have to add a !use_bool_formatter. At that point I think this more clearly expresses the intention.

37 ↗(On Diff #75034)

unsigned char is basically uint8_t, which is explicitly listed in the integral formatter. An unsigned char usually doesn't denote ascii text, but rather just a raw byte. So it makes sense to treat it the same as other numbers, while signed char is treated as ASCII. Note that the char formatter's implementation delegates to the integer formatter (hence supporting all the same options as the integer one) if the style is anything other than empty. So they're already very similar.

87 ↗(On Diff #75034)

No, but at least for hex, the difference between 0xABC and 0xabc is likely to start a heated debate :) If lots of people feel strongly I can standardize on only one, but in my experience there are times when lowercase looks awkward and times when uppercase looks awkward, so I wouldn't want to force one or the other.

72 ↗(On Diff #75034)

Because we can get here via write_signed. Basically, this means "was the original number less than 0". In write_signed, we check if it's less than 0 (hence meaning a negative has to be printed), then massage it to get the absolute value into an unsigned integer, then pass it into this function with IsNegative=true. This allows us to have one function that prints any number, and if IsNegative is true it just puts a - in front. Otherwise you end up with 3 or 4 functions which all look exactly the same except for 1-2 lines of code, which is irksome.

inglorion added inline comments.Oct 31 2016, 10:29 AM
87 ↗(On Diff #75034)

To clarify: I meant supporting only one casing in cases where there is no functional difference between the two. I'm all for supporting both uppercase and lowercase hex, and using uppercase and lowercase format specifiers seems like a great way to do it. I'm less sure about supporting both D and d, N and n. I realize that I attached my comment to the hex handling, which is the most confusing place I could have chosen, given that I'm actually happy with that one. My bad.

72 ↗(On Diff #75034)

Ah, I see. So write_unsigned_impl can be used to write both positive and negative integers, it's just that they're passed in as a sign bit and an unsigned magnitude. I guess I was confused by the name of the function.

inglorion requested changes to this revision.Oct 31 2016, 10:41 AM
inglorion added a reviewer: inglorion.

Requesting changes for a few minor outstanding issues:

  • include/llvm/Support/FormatAdapters.h line 20: Please make that "class" instead of "struct" for consistency with the other declarations.
  • include/llvm/Support/FormatProviders.h: Please remove the commented-out include.
  • There are a couple of cases where you have \brief followed by a single sentence paragraph. Since \brief is redundant in that case, please remove it.
  • include/llvm/Support/FormatCommon.h: I had a comment here about using switch instead of if so that the compiler can check that all cases are handled. I'm ok with the code as-is in this case, but pointing it out in case you want to change it while you're making the other changes.

Other than that, looks good to me and I will accept it once the above are addressed.

This revision now requires changes to proceed.Oct 31 2016, 10:41 AM
zturner updated this revision to Diff 76491.Oct 31 2016, 3:17 PM
zturner edited edge metadata.

Fixed remaining suggestions by inglorion

inglorion accepted this revision.Oct 31 2016, 4:10 PM
inglorion edited edge metadata.

Thanks! One last inline nit, then LGTM.

57 ↗(On Diff #76491)

You can now also replace this with case AlignStyle::Left: as suggested in

This revision is now accepted and ready to land.Oct 31 2016, 4:10 PM
silvas edited edge metadata.Oct 31 2016, 11:39 PM

It would be awesome if as a follow-up, you added a section to docs/ProgrammersManual.rst comparing string formatting options in LLVM (AFAIK, we now have this new shiny stuff, llvm::format, and raw_svector_ostream/raw_string_ostream). That should hopefully prevent new uses of the old stuff from creeping in (providing some good examples in there for common uses of the new stuff would also be good for helping this).

Also, it would be great if once this has landed, you manually verified that the doc comments end up looking "right" in the doxygen. Many LLVM developers that don't use IDE's will google for "llvm format_string" in hope of finding a doxygen page.

And thanks for all the work in putting this together! It's awesome!

labath added a subscriber: labath.Nov 1 2016, 4:26 AM

I am not sure about the use of rvalue references. Could you check that they do the thing you wanted to do?

22 ↗(On Diff #76491)

This will still call the copy constructor. Did you want to use Item(std::move(Item)) instead?

36 ↗(On Diff #76491)

std::forward is typically used in contexts where T is deduced (here it isn't as it is a class parameter). I am not 100% sure but I don't think it's usage here is correct (and std::move should be used instead). std::forward will only do something if you're expecting someone to explicitly instantiate the class with T = foo & and such.

zturner updated this revision to Diff 76798.Nov 2 2016, 3:43 PM
zturner edited edge metadata.
  1. Remove UDL syntax due to objections raised by chandlerc@.
  2. Added conversion operators to std::string and SmallString<N> to reduce the verbosity of formatting and saving the result as a string.
  3. Changed the name from format_string to formatv to reduce the verbosity again as a result of no longer having UDL syntax available.
aprantl added a subscriber: aprantl.Nov 2 2016, 7:55 PM
chandlerc edited edge metadata.Nov 7 2016, 9:38 AM

Nitpicks I noticed when reading it. Some of it is due to the changes to remove UDLs.

34–36 ↗(On Diff #76798)

Please use early return to reduce indentation.

41–43 ↗(On Diff #76798)

Same here.

15–19 ↗(On Diff #76798)

Comments still reference UDL syntax.

159 ↗(On Diff #76798)

This is now spelled "formatv". The entire comment block needs updating.

Also, I'd suggest not repeating the name of the thing in the brief comment. In Doxygen it is displayed right next to the name anyways so it becomes rapidly redundant and as it is here can easily fall out of sync. Instead, I'd write a one sentence summary for the brief section.

203 ↗(On Diff #76798)

This header is different from all the others. Maybe use markdown syntax if you need this kind of structure?

239–240 ↗(On Diff #76798)

I would prefer to skip the empty lines to make it clear what this block applies to.

labath added a comment.Nov 7 2016, 9:56 AM

Will it be possible for the custom part of the format specifier to contain :? So we can specify custom formatting for time points with something like formatv("{0:YYYY-MM-DD hh:mm:ss}", system_clock::now());. (It does not seem that it will be a problem at a first glance, but I thought I'd check).

Will it be possible for the custom part of the format specifier to contain :? So we can specify custom formatting for time points with something like formatv("{0:YYYY-MM-DD hh:mm:ss}", system_clock::now());. (It does not seem that it will be a problem at a first glance, but I thought I'd check).

It was a problem in my original implementation, but it should not be a problem anymore. The algorithm goes like this:

  1. Find the first comma. If it exists, everything up to the next colon (or end of string) is the alignment / justification field
  2. From the endpoint of step 1, find the first colon. If it exists, everything until the end of the string is the style specifier.

The only thing you simply cannot have within a style specifier is a curly brace, because that would terminate the replacement sequence. (It might be possible to support by putting an escaped closing curly brace, but I didn't think it was necessary to complicated the parsing logic with this, especially since we aim to support static verification in the future, and for those purposes, the simpler the better).

zturner updated this revision to Diff 77060.Nov 7 2016, 10:30 AM
zturner edited edge metadata.

Address nits from chandlerc.

zturner updated this revision to Diff 77218.Nov 8 2016, 11:27 AM

Update programmer's manual as per suggestion from clattner.

Last chance for further comments before I put this in later this afternoon.

This revision was automatically updated to reflect the committed changes.
emaste added a subscriber: emaste.Nov 11 2016, 4:37 PM

FreeBSD build is failing:

../include/llvm/Support/FormatProviders.h:360:10: error: no member named 'vector' in namespace 'std'
    std::vector<const char *> Delims = {"[]", "<>", "()"};
../include/llvm/Support/FormatProviders.h:360:17: error: expected expression
    std::vector<const char *> Delims = {"[]", "<>", "()"};
../include/llvm/Support/FormatProviders.h:361:26: error: use of undeclared identifier 'Delims'
    for (const char *D : Delims) {
3 errors generated.

Hopefully fixed in r286691

Fixed, thanks.