Page MenuHomePhabricator

[support] Allow TimeProfiler tracing across threads
Needs ReviewPublic

Authored by mbs-modular on Aug 31 2022, 8:07 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

*DRAFT*

The change to high_precision_clock to gain cross-thread consistency has not been tested on all platforms. Also need to measure slowdown on all platforms. Think we're going to use this change internally for a while before attempting to push upstream.


This change allows a TimeProfiler profiling entry to be created
on one thread and finished on another. This was triggered by a use
case where we needed to trace the scheduling delay for work items
ultimately handled by independent worker threads.

The clock is changed to std::chrono::high_resolution_clock, both for
its increased precision (our work items can be in the us range), but
also for its cross-thread consistency (all threads see the same clock).

As a pleasant side effect it is now possible to pay for the cost of
constructing a profiling entry with a possibly expensive detail string
outside of the critical section we wish to trace. I think we could do
better with shifting the cost of constructing detail strings to the final
trace writing phase, given the API already supports providing detail
strings via a callback. But I've not audited the code to check if
detail closures could have their lifetimes safely extended so dramatically.

Could not find unit tests, please let me know if I missed them.

Diff Detail

Event Timeline

mbs-modular created this revision.Aug 31 2022, 8:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2022, 8:07 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
mbs-modular requested review of this revision.Aug 31 2022, 8:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2022, 8:07 PM
lattner added a subscriber: lattner.

Adding a few reviewers who have worked on this, Russell Gallop has worked on something directly related.

russell.gallop added a subscriber: mehdi_amini.EditedSep 1 2022, 7:13 AM

Hi @mbs-modular ,

This looks interesting. Does this help with the case @mehdi_amini outlined here: https://reviews.llvm.org/D118550/new/#3323665 ? It would be great to be able to get the TimeProfile to trace anything multi-threaded in LLVM (anything using Thread Pool), rather than having to hook it into each use.

Regarding the patch, it looks like there are a number of non-functional changes here (e.g. move and rename Entry class, making Detail optional for TimeTraceScope, some style and formatting and a few others). Would it be possible to split those out to to a separate NFC patch so there is a patch focussed on the new functionality? I'm also trying to think whether the change to high_resolution_clock might have side-effects so could be a separate patch. I feel like it's the kind of thing that might trip up on one platform or another.

Could not find unit tests, please let me know if I missed them.

I'm not aware of any unit tests (though they sound like a good idea :-) ). There are some lit tests in clang/test/Driver and lld/test.

Likely unrelated but there is an outstanding bug on time-trace (https://github.com/llvm/llvm-project/issues/56554). Just noting it here in case anything here either fixes or desensitizes it.

Regards
Russ

Thanks @russell.gallop.

Does this help with the case @mehdi_amini outlined here: https://reviews.llvm.org/D118550/new/#3323665 ?

If I'm reading correctly that is lamenting the need for each thread to establish and tear down it's own context as a consequence of relying on the TLS profiling context. That's not a concern for us, and the TLS approach seems inevitable given the current push/pop API. A global lock-free ring for all events is probably a better approach, but that's a new impl entirely I think.

Our use case is pretty simple:

// thread A
auto entry = timeTraceProfilerBeginEntry("e")
enqueuWorkItem([entry = std::move(entry)]() {
    // thread B
    timeTraceProfilerEndEntry(std::move(entry));
    ... do the work ...
})

The entries for "e" now measure scheduling latency.

Would it be possible to split those out to to a separate NFC patch

Sure thing.

I'm also trying to think whether the change to high_resolution_clock might have side-effects

Likewise a bit troubled by that. I think we should pessimistically assume there's a memory barrier in the high res clock, so must make sure to never ask for now() from it if tracing is disabled. I think I need an extra if (TimeTraceProfilerInstance) test in timeTraceProfilerBeginEntry, and will check elsewhere.

there is an outstanding bug on time-trace (https://github.com/llvm/llvm-project/issues/56554).

Yes, appears unrelated. I'm avoiding any caller changes.

NFC (mostly!) changes in https://reviews.llvm.org/D133153, will bounce back here after that.

[support] Allow TimeProfiler tracing across threads

This change allows a TimeProfiler profiling entry to be created on one thread and finished on another. This
was triggered by a use case where we needed to trace the scheduling delay for work items ultimately handled by
independent worker threads. The clock is changed to std::chrono::high_resolution_clock, both for its
increased precision (our work items can be in the us range), but also for its cross-thread consistency (all
threads see the same clock).

As a pleasant side effect it is now possible to pay for the cost of constructing a profiling entry with a
possibly expensive detail string outside of the critical section we wish to trace. I think we could do better
with shifting the cost of constructing detail strings to the final trace writing phase, given the API already
supports providing detail strings via a callback. But I've not audited the code to check if detail closures
could have their lifetimes safely extended so dramatically.

mbs-modular edited the summary of this revision. (Show Details)
mbs-modular edited the summary of this revision. (Show Details)Sep 21 2022, 11:56 AM