This is an archive of the discontinued LLVM Phabricator instance.

Use native llvm JSON library for time profiler output
ClosedPublic

Authored by anton-afanasyev on Apr 12 2019, 5:19 AM.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2019, 5:19 AM
takuto.ikuta accepted this revision.Apr 12 2019, 6:32 PM

Thank you for follow up.

llvm/lib/Support/TimeProfiler.cpp
153

nit: I prefer to drop this new line.

This revision is now accepted and ready to land.Apr 12 2019, 6:32 PM

Deleted tailing '\n'

anton-afanasyev marked 2 inline comments as done.Apr 12 2019, 11:26 PM

Thank you for follow up.

Thanks! I'm to push it after fixing one thing in JSON.h library which instantiates template function with warning.

llvm/lib/Support/TimeProfiler.cpp
153

Ok, done

lebedev.ri requested changes to this revision.Apr 12 2019, 11:33 PM
lebedev.ri added a subscriber: lebedev.ri.

I know this will sound redundant, but have you done performance measurements on this?
I recently had to revert a similar cleanup in D58584.

This revision now requires changes to proceed.Apr 12 2019, 11:33 PM
anton-afanasyev marked an inline comment as done.Apr 12 2019, 11:46 PM

I know this will sound redundant, but have you done performance measurements on this?
I recently had to revert a similar cleanup in D58584.

Wow, thanks to pointing this out! Using json library is obviously heavier than just streaming plain text, but I didn't suppose such perf regressions. I'm to measure it.

I think it is true that json library is slower than just naive json output.
But this code run one time in a compile and should have relatively few json objects.
So visible performance effect will be low. Taking stats sounds good and I'm surprised if this patch change clang performance match.

I think it is true that json library is slower than just naive json output.
But this code run one time in a compile and should have relatively few json objects.
So visible performance effect will be low. Taking stats sounds good and I'm surprised if this patch change clang performance match.

Yes, that is my hope too, but given history i'd like to be confident :)
This will really depend on the number of the entries in the output.
I'd like to see perf measurements on some large compiler invocation, with *many* of these entries in output json report.

I think it is true that json library is slower than just naive json output.
But this code run one time in a compile and should have relatively few json objects.
So visible performance effect will be low. Taking stats sounds good and I'm surprised if this patch change clang performance match.

Yes, that is my hope too, but given history i'd like to be confident :)
This will really depend on the number of the entries in the output.
I'd like to see perf measurements on some large compiler invocation, with *many* of these entries in output json report.

I've measured time performance for the large source file test-suite/SingleSource/Benchmarks/Misc-C++-EH/spirit.cpp.

Five tests: without time tracing, with time tracing for old/new json output, with small granularity for old/new json output. Small time granularity leads to many entries in json output (~32K vs ~2K for default granularity).
As one can see, json library is actually slower but it has no notable effect by default and I believe it's eligible price even for large number of entries.

perf stat -r 5 ./build_rel_base/bin/clang++ -w -S spirit.cpp
...
3362.032795      task-clock (msec)
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace spirit.cpp
...
3577.592945      task-clock (msec)
perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace spirit.cpp
...
3582.747398      task-clock (msec)
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp
...
3634.623793      task-clock (msec)
perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp
...
3995.315749      task-clock (msec)
lebedev.ri resigned from this revision.Apr 15 2019, 5:34 AM

I think it is true that json library is slower than just naive json output.
But this code run one time in a compile and should have relatively few json objects.
So visible performance effect will be low. Taking stats sounds good and I'm surprised if this patch change clang performance match.

Yes, that is my hope too, but given history i'd like to be confident :)
This will really depend on the number of the entries in the output.
I'd like to see perf measurements on some large compiler invocation, with *many* of these entries in output json report.

I've measured time performance for the large source file test-suite/SingleSource/Benchmarks/Misc-C++-EH/spirit.cpp.

Thank you for doing the measurements.

Five tests: without time tracing, with time tracing for old/new json output, with small granularity for old/new json output.

Small time granularity leads to many entries in json output (~32K vs ~2K for default granularity).

It would be interested to see memory consumption changes (heaptrack),
but with this small maximal number of entries, pretty much anything should not be too noticeable.

As one can see, json library is actually slower but it has no notable effect by default and I believe it's eligible price even for large number of entries.

perf stat -r 5 ./build_rel_base/bin/clang++ -w -S spirit.cpp
...
3362.032795      task-clock (msec)
perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace spirit.cpp
...
3577.592945      task-clock (msec)

+215 ms or +~6% (as compared to without -ftime-trace)

perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace spirit.cpp
...
3582.747398      task-clock (msec)

+220 ms or +~7% (as compared to without -ftime-trace, or +5ms/0.1% as compared to without this patch)

perf stat -r 5 ./build_rel_base/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp
...
3634.623793      task-clock (msec)

+273 ms or +~8% (as compared to without -ftime-trace)

perf stat -r 5 ./build_rel_exp/bin/clang++ -w -S -ftime-trace -mllvm -time-trace-granularity=0 spirit.cpp
...
3995.315749      task-clock (msec)

+733 ms or +19% (as compared to without -ftime-trace, or +360ms/+10% as compared to without this patch)

Okay, so the price increase in default config is reasonably small.
It is noticeably larger with bigger number of events though.
I don't like how non-thin of a wrapper "llvm/Support/JSON.h" is. :/

TLDR: Not horrible, but above what i would consider a reasonable price for abstraction.

llvm/lib/Support/TimeProfiler.cpp
91–94

There is no reserve() function in json::Array, thus you end up doing the memory dance *every* entry. :/
(alloc larger, move, delete previous)
I suspect this accounts for majority of the extra cost, although not all of it.

97

emplace_back() doesn't work?

This revision is now accepted and ready to land.Apr 15 2019, 5:34 AM
anton-afanasyev marked 4 inline comments as done.Apr 15 2019, 5:47 AM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91–94

Good point to change in json lib.

97

It works (for explicitly given cons), I've changed it in a separate review: https://reviews.llvm.org/D60663

anton-afanasyev marked 3 inline comments as done.Apr 15 2019, 5:50 AM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
97

Oops, that's another push_back(), I'm wrong.

anton-afanasyev marked an inline comment as done.

Updated

This revision was automatically updated to reflect the committed changes.
lebedev.ri added inline comments.Apr 16 2019, 10:33 AM
llvm/lib/Support/TimeProfiler.cpp
91–94

If you don't intend to look into that (should be rather trivial),
can you please at least raise a bug about this, with those perf numbers?

anton-afanasyev marked an inline comment as done.Apr 16 2019, 10:45 AM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91–94

I've measured the perf effect of adding Array.reserve(33000) (changed JSON.h a bit), it gives 3845 msec (vs 3995 msec) for -time-trace-granularity=0. Effect is noticeable, but actual implementation need a smart heuristics to determine the entries number to reserve. For the default option (time-trace-granularity = 500 msec) this doesn't matter.

91–94

What do you mean by bug? Do you mean I should add Array.reserve(N) with N taken heuristically from different runs for different time-trace-granularity? Or would just Array.reserve(4000) be enough for default granularity?

anton-afanasyev marked an inline comment as done.Apr 16 2019, 10:54 AM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91–94

Also this result means that time taken by json output could be decreased by 40% (1 - (3845-3634)/(3995-3634)), but this is actual and significant only for time-trace-granularity=0. The effect of array reservation for the default option value is insignificant.

lebedev.ri added inline comments.Apr 16 2019, 10:54 AM
llvm/lib/Support/TimeProfiler.cpp
91–94

What do you mean by bug?

The lack of json::array::reserve()

Do you mean I should add Array.reserve(N) with N taken heuristically from different runs for different time-trace-granularity? Or would just Array.reserve(4000) be enough for default granularity?

Hm?
You know how many Events in json::Array you will end up with - Entries.size() + CountAndTotalPerName.size() + 1.
You can even assert that in the end.

anton-afanasyev marked an inline comment as done.Apr 16 2019, 11:12 AM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91–94

Ooops, you're right! Here is fix for json Array: https://reviews.llvm.org/D60788

lebedev.ri added inline comments.Apr 16 2019, 11:25 AM
llvm/lib/Support/TimeProfiler.cpp
91–94

Not 40% though.
It's calculated as abs(old-new)/old = 5.49%

anton-afanasyev marked an inline comment as done.Apr 16 2019, 11:30 AM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91–94

These are different metrics. I'm talking about time portion taken by Array of whole time taken by json lib using.

lebedev.ri marked an inline comment as done.Apr 16 2019, 11:39 AM
lebedev.ri added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91–94

Ah, right, sorry.

anton-afanasyev marked an inline comment as done.Apr 16 2019, 11:39 AM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91–94

My metrics results relates to your words that majority or all extra cost is taken by "dancing" with Array. Actually not the majority but ~40% -- still significant number.

lebedev.ri added inline comments.Apr 16 2019, 11:48 AM
llvm/lib/Support/TimeProfiler.cpp
91–94

The problem is that every entry in std::vector inside of json::array is json::Value,
and json::Value internally is likely a json::object,
and json::object is internally a llvm::densemap

So you have an array of densemaps... Naturally, there is quite a bit of overhead.
Not all of it comes from dancing around due to the lack of reserve() of json::array.

Just wanted to note I'm going to look at drafting a lower-level streaming output API so cases like this can avoid materializing all those expensive objects.

That said I don't claim any special expertise in this area, if someone else wants to jump on it, go ahead.

llvm/lib/Support/TimeProfiler.cpp
91

Sorry, I was vaguely aware of this patch but hadn't looked at the detail.
As there's likely to be lots of events, I'd suggest still writing the {"traceEvents":[ / ]} container by hand and the "," separator between events if you care about performance.
You can still use the JSON lib for composing and serializing each individual event.
This is what we do in clangd.

This will reduce peak mem usage, but also the repeated allocation/deallocation of similar structures may make allocation cheaper (I know it does a lot with tcmalloc).

Not sure how much this will help, as always with performance, measuring is a good idea.

anton-afanasyev marked an inline comment as done.Apr 16 2019, 1:31 PM
anton-afanasyev added inline comments.
llvm/lib/Support/TimeProfiler.cpp
91

Hmm, good point, thanks. I'm to try and measure it. My preliminary thought is that json::Array reserving has already got rid of this time extra cost (not memory peak usage though).