This is an archive of the discontinued LLVM Phabricator instance.

Add a warning, flags and pragmas to limit the number of pre-processor tokens in a translation unit
ClosedPublic

Authored by hans on Jan 14 2020, 7:16 AM.

Details

Summary

See https://docs.google.com/document/d/1xMkTZMKx9llnMPgso0jrx3ankI4cv60xeZ0y4ksf4wc/preview for background discussion.

This adds a warning, flags and pragmas to limit the number of pre-processor tokens either at a certain point in a translation unit, or overall.

The idea is that this would allow projects to limit the size of certain widely included headers, or for translation units overall, as a way to insert backstops for header bloat and prevent compile-time regressions.

What do you think?

Diff Detail

Event Timeline

hans created this revision.Jan 14 2020, 7:16 AM
rnk added a comment.Jan 21 2020, 4:08 PM

I waited to see if there was any other feedback, but I'm in favor of this.

Should we try to come up with better pragma names? clang max_tokens doesn't seem to call to mind what it does: warn if there have been more than this many tokens so far in the translation unit. max_file_tokens has to do with the number of tokens in the translation unit overall, but it uses the terminology "file" instead of "translation unit". The user could interpret that as being in the current source file, ignoring includes.

Some ideas for the immediate version:

  • clang max_tokens_so_far
  • clang max_tokens_lexed
  • clang max_tokens_here

Some ideas for end-of-tu:

  • clang max_translation_unit_tokens
  • clang max_tu_tokens
  • clang global_max_tokens
kimgr added a subscriber: kimgr.Jan 21 2020, 11:01 PM

I just want to say that finding the correlation between token count and compile time is a bit of a breakthrough! Could you expose a flag for printing token count so users can run their own analysis? Or does that already exist in baseline clang? It's easier to set a maximum for a codebase if the distribution is known.

Thanks!

hans added a comment.Jan 22 2020, 9:01 AM
In D72703#1832678, @rnk wrote:

I waited to see if there was any other feedback, but I'm in favor of this.

Should we try to come up with better pragma names? clang max_tokens doesn't seem to call to mind what it does: warn if there have been more than this many tokens so far in the translation unit. max_file_tokens has to do with the number of tokens in the translation unit overall, but it uses the terminology "file" instead of "translation unit". The user could interpret that as being in the current source file, ignoring includes.

Thanks for thinking about the names. I agree they are not ideal.

Some ideas for the immediate version:

  • clang max_tokens_so_far
  • clang max_tokens_lexed
  • clang max_tokens_here

I went with max_tokens because it's shorter, and I figured maybe the "here" could be implicit as most things happen where the pragma is. But since we'll also have the per-tu variant, maybe it makes sense to have a longer name. Of your alternatives I like max_tokens_here best.

Some ideas for end-of-tu:

  • clang max_translation_unit_tokens
  • clang max_tu_tokens
  • clang global_max_tokens

I went with "file" because tu is such a technical term and I'm not sure we generally use it in clang's interface. What do you think about max_tokens_total?

hans added a comment.Jan 22 2020, 9:06 AM

I just want to say that finding the correlation between token count and compile time is a bit of a breakthrough!

I assume the same correlation could also be found with lines of code, but I think tokens is a better dimension to measure since it's less likely to be gamed, and it also is also kind of the basic work unit that the compiler deals with.

Could you expose a flag for printing token count so users can run their own analysis? Or does that already exist in baseline clang? It's easier to set a maximum for a codebase if the distribution is known.

I used this patch with -fmax-tokens 1 and scraped the output for my measurements. I would like to avoid adding a separate flag if we can avoid it.

rnk added a comment.Jan 22 2020, 10:59 AM

I like the max_tokens_here / max_tokens_total variants.

hans updated this revision to Diff 239944.Jan 23 2020, 10:52 AM

Doing max_tokens_here / max_tokens_total.

Please take another look.

rnk accepted this revision.Jan 23 2020, 2:47 PM

lgtm :)

This revision is now accepted and ready to land.Jan 23 2020, 2:47 PM

I assume the same correlation could also be found with lines of code, but I think tokens is a better dimension to measure since it's less likely to be gamed, and it also is also kind of the basic work unit that the compiler deals with.

Yeah, and code with lots of documentatinon comments isn't penalized.

Could you expose a flag for printing token count so users can run their own analysis? Or does that already exist in baseline clang? It's easier to set a maximum for a codebase if the distribution is known.

I used this patch with -fmax-tokens 1 and scraped the output for my measurements. I would like to avoid adding a separate flag if we can avoid it.

Ah, it didn't occur to me that -fmax-tokens itself could be used like this. Thanks!

FWIW, I'm not ecstatic about max_tokens_here, I thought max_tokens_lexed had a nicer ring to it. /peanut.

hans added a comment.Jan 27 2020, 6:59 AM

FWIW, I'm not ecstatic about max_tokens_here, I thought max_tokens_lexed had a nicer ring to it. /peanut.

I mainly like the clarity in the difference between _here and _total. With _lexed, I feel that would not be as clear.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2020, 7:07 AM