Page MenuHomePhabricator

[extra] pptrace - preprocessor tracing and testing tool

Authored by jtsoftware on Oct 24 2013, 1:46 PM.



pp-trace is a tool for displaying preprocessor activity by means of the PPCallbacks interface. Its primary reason for existence is for testing the PPCallbacks mechanism, but it also might be useful as a tool for understanding preprocessor activity (as an alternative to clang's -P option for looking at preprocessor output).

This is a first cut, for proof of concept, with only two tests for demonstration purposes.

It supports two output formats, YAML and a generic format. Originally I just put in the YAML, but I felt it was kind of verbose, so I added the generic format as a default.

Please see the file comments for details.

Documentation and more tests will follow in a later checkin.

Diff Detail

Event Timeline

I don't see the purpose of the "generic" output format. It is only 1 character away from being valid YAML. The following is valid YAML (Assuming the ArgumentK values are unique, which I think is true in this case):

- CallbackName:
    Argument1: Value1
    Argument2: Value2

The overhead of maintaining two output formats doesn't seem worth it. For simplicity of code that manipulates this output, it may be best to use something like:

- Kind: CallbackName
  Argument1: Value1
  Argument2: Value2

(i.e., have the callback name just be another field of the record; this is a very common pattern for handling serialized variant data in dynamic languages (and YAML's semantic model is basically wed to dynamic languages)).

Since you don't need to read this information back into the same data structure that produced it, it seems like it might be simpler to forego YAMLIO.

21 In particular, "PPCallbacksTracker.h" should be first.


Any particular reason why the capitalization for this is different from the rest?


This output doesn't seem very useful.


Ew. No. Fixed size buffers are evil.

It seems like most of the reason you need this is just to print a string "Foo(Bar)". Better to just have a subroutine taking the two strings "Foo" and "Bar" and using raw_string_ostream or raw_svector_ostream to generate that output.


Don't copy the documentation from the base class.


Use LLVM_OVERRIDE (i.e. C++11 override) instead of prefixing virtual to all these overriden virtual functions. That way, C++11 builds (which many people are doing these days) will benefit from the error checking.

jtsoftware updated this revision to Unknown Object (????).Oct 25 2013, 7:46 AM

I've revised to use just a YAML output format without YAML I/O, and address the other comments, except the moduleImport name, which I'll fix in later clang+extra checkin.

On second thought, aren't function members supposed start with a lower case? Are the other functions wrong, or is there a convention for virtual/callback functions?


It's like this in the base class. I'll fix it in a separate checkin.

silvas added inline comments.Oct 25 2013, 12:53 PM

I think that Loc.printToString may sometimes not do quite what you want. See You might want to replicate that logic in a customized way here to make sure that the results are consistently as intended.


Use raw_string_ostream or raw_svector_ostream here and anywhere else you have used sprintf.


You can simplify this to just .push_back(Argument(Name, Value))


This format is annoying to parse downstream. Most of the uses of this seem to be just to tag the type of data. For example, SourceRange's should be just a YAML list of two entries: [ loc1, loc2 ], and enum constants could be just their name (or maybe better EnumName::EN_EnumerantName).

The SourceLocation format that you are using to print them seems fine, since that can be deconstructed with a single call to .split(':') (in Python) or similar.


This comment is out of date. Please check the comments on the other methods as well.


I recommend formatting this as a list of YAML flow records (i.e., basically JSON) for easy downstream consumption. So it would be something like:

[{name: foo, loc: sourcelocationstring}, ...]

This seems like it should produce a YAML list in the end, instead of a YAML string formatted like a function call (which would have to be parsed by a consumer, instead of being handled in the YAML parser itself). The YAML flow style can be used, e.g. [ foo, bar, baz ]. For example, this is easily consumed by a client using a YAML parser without an additional parsing step:

- Callback: Name
  Args: [ foo, bar, baz ]
  Argument2: Value2

The fact that it is a list of macro arguments should be clear from the context.


Tiny bit of LLVM style guidance:

  • unless there's a good reason, a for loop integer induction variable should be called i and the upper bound should be called e (which is inconsistent with the rest of the coding style, which is weird, but that's the style).
  • In C++, when you are iterating through a sequence sequentially, the convention is to use != to compare (this generalizes to e.g. linked lists, which can be traversed sequentially, but whose iterators can't have a meaningful <) and ++var (which avoids potentially making a copy of the iterator); obviously it doesn't matter in this case, but for consistency and clarity that is the convention. These conventions originate in the STL. If you aren't familiar with the various conceptual iterator types, you can read a summary here:

So I would recommend rewriting this loop header as:

for (int i = 0, e =  Value->getNumArguments(); i != e; ++i) {
  • When outputting constructs with "separator" semantics (e.g. a comma-separated list) rather than "terminator semantics" (e.g. a semicolon follows each statement, including the last), the pattern I've seen most commonly in LLVM is

    for (int i = ..., e = ...; i != e; ++i) { if (i) OS << ", "; [...] }

and I would recommend following this since it makes it very clear "up front" (at the beginning of the loop) that this is outputting a comma-separated list.

So overall I would recommend writing this loop as:

for (int i = 0, e =  Value->getNumArguments(); i != e; ++i) {
  if (i)
    SS << ", ";
  SS << PP.getSpelling(*Value->getUnexpArgument(i));

This is not correct use of this macro, which is semantically equivalent to the C++11 override specifier. See


I assume that a -- can be used to separate the [compiler options] from the rest. Is that correct? You probably should mention the behavior in that case (especially if what I described is *not* the case, to prevent confusion).


This is not correct use of "i.e." (which basically means "in other words"). You probably meant "e.g." which basically means "for example".


I feel like I'm probably nitpicking here, but I've never seen a reference in the LLVM codebase initialized with parentheses. Please use e.g. Argument &Arg = *AI; here and elsewhere.

(hopefully someday clang-tidy will catch and fix these things).


Use an early exit to simplify this:

if (!Error.empty()) {
  // print message
  return 1;

Why do you need -Xclang for -std=c++11?

Also, why does -Xclang -triple=x86_64 need to be there if you are already passing -target x86_64?

kimgr added a comment.Oct 27 2013, 5:39 AM

I applied the patch and did a clean build on Clang r193498 on Windows, MSVC 10.

For some reason, I'm getting an assertion failure whenever I #include <stdio.h>:
Assertion failed: Result < Start+NumUnexpArgTokens && "Invalid arg #", file llvm-trunk\llvm\tools\clang\lib\Lex\MacroArgs.cpp, line 124

I've boiled it down to the following repro:

// repro.cpp
#define DECORATED(id) id

enum Enum
  DECORATED( One ) = 1

Original problem surfaced in C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include\CodeAnalysis\sourceannotations.h:56.

clang-check accepts it without complaining, so it seems like there's something amiss in the setup of pp-trace.

jtsoftware updated this revision to Unknown Object (????).Oct 28 2013, 9:52 AM


I revised per your comments, except for the one about -- in the command line, as I don't understand. I think you can use either "-' or "--" for options. I don't know about using "--" to separate the compiler options.


What I usually do in the case of strange build errors on Windows is update cmake, delete the CMakeCache.txt file, rerun the initial camke command line ("cmake -G "Visual Studio 10" -DCLANG_BUILD_EXAMPLES=ON ." in my case), run clean in Visual Studio, then a build. I just did this, but didn't see any error, so I'm not sure what to do.



Thanks for attempting to repro. I've done the same here, and I'm still seeing the problem. I'm using the Ninja CMake generator with cl.exe from MSVC 10.

I can give proper Visual Studio a shot later too and see if the problem persists.

silvas accepted this revision.Oct 28 2013, 12:10 PM

Pending on the issue that Kim is running into, I think this looks good enough to commit.


It's probably simpler to just use a for (int i = 0, e = Ids.size(); ... style loop here.


Same here.


Is this missing a colon after "Loc"? Try running it through a YAML parser (utils/yaml-bench with the -canonical flag should be sufficient to ensure syntactic validity (see test/YAMLParser/ for example usage)).


MD here is a bit cryptic. I see that you are doing this to be maximally consistent with the argument name, but I think that MacroDirective would be better from a readability standpoint.

jtsoftware updated this revision to Unknown Object (????).Oct 28 2013, 5:52 PM


Thanks for the excellent review, and the reference to yaml-bench. It caught a few things so far, and I'll have to use it when I write the other tests too. It didn't like the backslashes in Windows file paths, or the colons in SourceLocations, so I put in some character replacing and quoting.

I revised the for loops as pointed out.

Also, I wanted to ask about the special cases, such as where I use values like (null) to represent when a pointer is null, or (invalid) when a SourceLocation has the invalid flag set. In the former case, passing a null pointer is expected in some situations. The YAML parser doesn't seem to mind (null), but is this reasonable semantically?

Also, again semantically, you probably already noticed I'm not trying to flatten the full data structures for possible later resurrection, just providing what I think is sufficient high-level information to know what the preprocessor is up to.


Thanks also for your help. I'll install ninja cmake and give it a shot.


Still seeing the same assertion when I build with VS10; my steps to repro:

llvm-trunk$ mkdir vcbuild && cd vcbuild
llvm-trunk/vcbuild$ cmake -G "Visual Studio 10" ..\llvm
llvm-trunk/vcbuild$ start LLVM.sln
(right-click and build pp-trace only)
llvm-trunk/vcbuild$ bin\Debug\pp-trace repro.cpp

I can't explain it, but it seems to be consistent :-(


I totally misunderstood. I thought you were talking about a problem building pp-trace, but you're talking about an assertion while running pp-trace, which I'm getting too with your repro. I'm looking into it. Thanks for the help.


jtsoftware updated this revision to Unknown Object (????).Oct 29 2013, 7:52 AM

This revision includes a fix for the assertion Kim saw.

Since we seem to be close on the main idea, after this I'll start adding more test to get better coverage.


kimgr added a comment.Oct 29 2013, 1:00 PM

Ah, I completely missed the fact that your code was calling the asserting method. Works fine now, thanks!

Bunch of aesthetic comments added, I'll try this on some bigger preprocessor challenges...


You don't need *- C++ -* in .cpp files.


Indentation looks funny here, but maybe that's a good thing considering that "0" is apparently not in the enum. :-)


This duplicates the information from the header. I vote to remove it.


The comment seems redundant to me.


Should these be doxygen comments?






I saw Sean suggested lower-case index names, but the coding guidelines seem to advocate upper-case I and E.


I'm not sure what to think about naming the argument in the output something else than the actual arg. It stands out, but it would be nice if the MD argument was in fact named MacroDirective (though it clashes with the type...)












I'll never get used to the fact that SmallSet::count returns a bool :-)


This should be a const std::string &, right?


const std::string &


Should be PPTRACE_CALLBACKS_H or maybe match the filename in full, i.e. PPTRACE_PPCALLBACKSTRACKER_H?


Style guide says to avoid repeating class/function names in \brief. Not sure what to say about the ctor in \brief, maybe something to the effect of ownership?


I prefer const_iterator and const refs to elements, but I don't know what the general convention for LLVM is.


/*MaxSplit=*/-1, /*KeepEmpty=*/false


Add braces around this block to keep symmetry with else

kimgr added a comment.Oct 29 2013, 1:46 PM

Running pp-trace on this file still asserts, unfortunately:

// repro2.cpp
#define X X_IMPL(a,b) X_IMPL2(c)

#define X_IMPL(p1,p2)
#define X_IMPL2(p1)


This looks nonsensical, but is based on C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include\sal.h:454.

I suspected that Value->getNumArguments() might return zero, which would cause your e to become negative (or very large), but the number in this case is four. You still have a bug if getNumArguments ever returns zero, but maybe it never will (i.e. assert on it)

kimgr added a comment.Oct 29 2013, 3:37 PM

OK, I took some time to step through this. Enumerating the arguments of a MacroArgs seems like a difficult problem (tm).

  • getNumArguments does not return the number of arguments, it returns the number of tokens in the argument list
  • getUnexpArgument takes an index, but skips over eof tokens

Here's a way I've found to enumerate over the argument tokens:

// This can probably be translated into a for loop,
// but it's late here and my head is buzzing...
unsigned ArgTokenCount = Value->getNumArguments();
unsigned I = 0;
while (I < ArgTokenCount) {
    const clang::Token* Current = Value->getUnexpArgument(I);
    unsigned TokenCount = Value->getArgLength(Current) + 1; // include EOF
    ArgTokenCount -= TokenCount;

One problem is arguments rarely consist of a single token, but we can look at things like:

#define X Y(1 + 6, 2)
#define Y(a,b)


You'd need to find a way to rejoin a sequence of tokens into a string. The tokens appear to live in contiguous memory, though, so once you have the first token and the length of every argument, the entire arg is covered by [Current, Current + TokenCount). Not sure how to go from there to something you can render, but it should be doable.

Hope that helps!

jtsoftware updated this revision to Unknown Object (????).Oct 30 2013, 5:50 AM

Thanks, Kim, for the excellent review, and also for solving the walking issue. I've applied the suggested changes, revised the MacroArgs argument walking code segment, and added another test.

More tests will follow, but first I'm going to run the tool over a set of platform headers, to see if I can ring out any more crashes.


jtsoftware updated this revision to Unknown Object (????).Oct 30 2013, 9:17 AM

I made one more change to the macro args formatting, as it could produce output with YAML symbols that will confuse a YAML parser. I have it print tokens that are not identifiers or numbers by the results of their getName() call value, enclosed in '<' and '>'.

I ran pp-trace over a full set of platform headers, including running yaml-bench over the output to check for valid YAML, with no errors.

Though I still need to write tests for all the callback types, and user documentation, I think it might be ready for an initial checkin.


I'm in no position to sign off commits, but this looks good and useful to me. A few small comments/questions, otherwise I'm happy. Thanks!


Nice! :-)


const reference?


Why not a const reference?


Just pass callbacks tracker into addPPCallbacks immediately:

PP.addPPCallbacks(new PPCallbacksTracker(Ignore, CallbackCalls, PP));

No use keeping this as a member


Why is the "-" necessary? I just leave out -output and pp-trace dutifully prints to stdout.

(sorry for the delay). As per my earlier comment now that it seems like the
issue Kim was seeing is resolved, feel free to check this in.

More comments inline:

jtsoftware closed this revision.May 5 2014, 6:05 PM

Committed in r193743 (and later fixes in r194440 r194422 r194081 r194079 r193842 r193841 r193746).


The comment of the Mapping enum mentions value of 0 meaning "uncomputed". Indentation courtesy of clang-format:-)


Sean pointed out that since these are overrides, they shouldn't have the Doxygen comments.