Replace plain json text output with llvm JSON library wrapper using.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 30595 Build 30594: arc lint + arc unit
Event Timeline
Thank you for follow up.
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
153 | nit: I prefer to drop this new line. |
Thanks! I'm to push it after fixing one thing in JSON.h library which instantiates template function with warning.
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
153 | Ok, done |
I know this will sound redundant, but have you done performance measurements on this?
I recently had to revert a similar cleanup in D58584.
Wow, thanks to pointing this out! Using json library is obviously heavier than just streaming plain text, but I didn't suppose such perf regressions. I'm to measure it.
I think it is true that json library is slower than just naive json output.
But this code run one time in a compile and should have relatively few json objects.
So visible performance effect will be low. Taking stats sounds good and I'm surprised if this patch change clang performance match.
Yes, that is my hope too, but given history i'd like to be confident :)
This will really depend on the number of the entries in the output.
I'd like to see perf measurements on some large compiler invocation, with *many* of these entries in output json report.
I've measured time performance for the large source file test-suite/SingleSource/Benchmarks/Misc-C++-EH/spirit.cpp.
Five tests: without time tracing, with time tracing for old/new json output, with small granularity for old/new json output. Small time granularity leads to many entries in json output (~32K vs ~2K for default granularity).
As one can see, json library is actually slower but it has no notable effect by default and I believe it's eligible price even for large number of entries.
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S spirit.cpp ... 3362.032795 task-clock (msec)
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace spirit.cpp ... 3577.592945 task-clock (msec)
perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace spirit.cpp ... 3582.747398 task-clock (msec)
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp ... 3634.623793 task-clock (msec)
perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp ... 3995.315749 task-clock (msec)
Thank you for doing the measurements.
Five tests: without time tracing, with time tracing for old/new json output, with small granularity for old/new json output.
Small time granularity leads to many entries in json output (~32K vs ~2K for default granularity).
It would be interested to see memory consumption changes (heaptrack),
but with this small maximal number of entries, pretty much anything should not be too noticeable.
As one can see, json library is actually slower but it has no notable effect by default and I believe it's eligible price even for large number of entries.
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S spirit.cpp ... 3362.032795 task-clock (msec)perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace spirit.cpp ... 3577.592945 task-clock (msec)
+215 ms or +~6% (as compared to without -ftime-trace)
perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace spirit.cpp ... 3582.747398 task-clock (msec)
+220 ms or +~7% (as compared to without -ftime-trace, or +5ms/0.1% as compared to without this patch)
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp ... 3634.623793 task-clock (msec)
+273 ms or +~8% (as compared to without -ftime-trace)
perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp ... 3995.315749 task-clock (msec)
+733 ms or +19% (as compared to without -ftime-trace, or +360ms/+10% as compared to without this patch)
Okay, so the price increase in default config is reasonably small.
It is noticeably larger with bigger number of events though.
I don't like how non-thin of a wrapper "llvm/Support/JSON.h" is. :/
TLDR: Not horrible, but above what i would consider a reasonable price for abstraction.
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | There is no reserve() function in json::Array, thus you end up doing the memory dance *every* entry. :/ | |
97 | emplace_back() doesn't work? |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | Good point to change in json lib. | |
97 | It works (for explicitly given cons), I've changed it in a separate review: https://reviews.llvm.org/D60663 |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
97 | Oops, that's another push_back(), I'm wrong. |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | If you don't intend to look into that (should be rather trivial), |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | I've measured the perf effect of adding Array.reserve(33000) (changed JSON.h a bit), it gives 3845 msec (vs 3995 msec) for -time-trace-granularity=0. Effect is noticeable, but actual implementation need a smart heuristics to determine the entries number to reserve. For the default option (time-trace-granularity = 500 msec) this doesn't matter. | |
91–94 | What do you mean by bug? Do you mean I should add Array.reserve(N) with N taken heuristically from different runs for different time-trace-granularity? Or would just Array.reserve(4000) be enough for default granularity? |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | Also this result means that time taken by json output could be decreased by 40% (1 - (3845-3634)/(3995-3634)), but this is actual and significant only for time-trace-granularity=0. The effect of array reservation for the default option value is insignificant. |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 |
The lack of json::array::reserve()
Hm? |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | Ooops, you're right! Here is fix for json Array: https://reviews.llvm.org/D60788 |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | Not 40% though. |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | These are different metrics. I'm talking about time portion taken by Array of whole time taken by json lib using. |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | Ah, right, sorry. |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | My metrics results relates to your words that majority or all extra cost is taken by "dancing" with Array. Actually not the majority but ~40% -- still significant number. |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91–94 | The problem is that every entry in std::vector inside of json::array is json::Value, So you have an array of densemaps... Naturally, there is quite a bit of overhead. |
Just wanted to note I'm going to look at drafting a lower-level streaming output API so cases like this can avoid materializing all those expensive objects.
That said I don't claim any special expertise in this area, if someone else wants to jump on it, go ahead.
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91 | Sorry, I was vaguely aware of this patch but hadn't looked at the detail. This will reduce peak mem usage, but also the repeated allocation/deallocation of similar structures may make allocation cheaper (I know it does a lot with tcmalloc). Not sure how much this will help, as always with performance, measuring is a good idea. |
llvm/lib/Support/TimeProfiler.cpp | ||
---|---|---|
91 | Hmm, good point, thanks. I'm to try and measure it. My preliminary thought is that json::Array reserving has already got rid of this time extra cost (not memory peak usage though). |
Sorry, I was vaguely aware of this patch but hadn't looked at the detail.
As there's likely to be lots of events, I'd suggest still writing the {"traceEvents":[ / ]} container by hand and the "," separator between events if you care about performance.
You can still use the JSON lib for composing and serializing each individual event.
This is what we do in clangd.
This will reduce peak mem usage, but also the repeated allocation/deallocation of similar structures may make allocation cheaper (I know it does a lot with tcmalloc).
Not sure how much this will help, as always with performance, measuring is a good idea.