Page MenuHomePhabricator

[Support] Have ThreadPool initialize a TimeTraceProfiler per thread
AbandonedPublic

Authored by int3 on Jan 29 2022, 8:51 PM.

Details

Reviewers
anton-afanasyev
russell.gallop
mehdi_amini
MaskRay
Group Reviewers
Restricted Project
Summary

This makes the profiler easier to use; we no longer need to
remember to initialize it every time we fork. It also means that we
initialize the profiler at most once per thread instead of once per
task.

TODO: we should probably implement this for Parallel.h as well, but
perhaps only once we have a use case to test it with.

Diff Detail

Unit TestsFailed

TimeTest
60,080 msx64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Output/scariness_score_test.cpp.tmp
820 msx64 debian > LLVM.CodeGen/AMDGPU::amdpal-callable.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=amdgcn--amdpal -verify-machineinstrs < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefixes=GCN,SDAG,GFX8 -enable-var-scope /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll

Event Timeline

int3 created this revision.Jan 29 2022, 8:51 PM
int3 requested review of this revision.Jan 29 2022, 8:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 29 2022, 8:51 PM

Hi @int3 ,

Thanks for the patch. When I originally added the multi-threaded support for the time profiler I considered adding the initialize/finish thread into the threadpool itself (llvm/lib/Support/ThreadPool.cpp). In the end I thought it was a bit messy there so just added it into LTO.cpp. With the increased use of multi-threading I think it would be worth reconsidering putting this into ThreadPool.cpp so we don't need to add this everywhere we want to trace multithreaded code. I think that the RAII code you propose will help this fit into ThreadPool.cpp neatly. For all I know there may be other problems with doing that, but I think it will be neater now this is used in more places. What do you think about trying that?

Regards
Russ

int3 added a comment.Jan 31 2022, 5:16 AM

Yeah, I think that's probably a good idea. It took me a while to figure out that initializing the profiler on a per-thread basis was required; it was not at all obvious why my profiler events had started disappearing while I was parallelizing lld. Having it done automatically would've saved me some headache.

Another advantage would be that we could actually initialize the profiler once per thread, instead of once per task.

One drawback I guess is that there isn't a good way to specify the ProcName for these threads any more. But perhaps the ProcName was never really that relevant, and the spawned threads could just initialize it with the empty string?

Yeah, I think that's probably a good idea. It took me a while to figure out that initializing the profiler on a per-thread basis was required; it was not at all obvious why my profiler events had started disappearing while I was parallelizing lld. Having it done automatically would've saved me some headache.

Yes, it might help anyone working to further parallelise lld (as long as the scopes are there).

Another advantage would be that we could actually initialize the profiler once per thread, instead of once per task.

That's a good point.

One drawback I guess is that there isn't a good way to specify the ProcName for these threads any more. But perhaps the ProcName was never really that relevant, and the spawned threads could just initialize it with the empty string?

ProcName was originally intended to be clang/lld etc. at the top level, so I'm not sure it's very useful in the way it's used in threads at the moment. This could be empty, or (e.g.) "Thread 1" etc.

int3 updated this revision to Diff 404630.Jan 31 2022, 11:18 AM
int3 retitled this revision from [Support][nfc] Add RAII helper for cleaning up time profiler in thread to [Support] Have ThreadPool initialize a TimeTraceProfiler per thread.
int3 edited the summary of this revision. (Show Details)

alternative approach

alternative approach

Thanks.

I think this now needs different/extra reviewer(s) for the ThreadPool change. Added @mehdi_amini.

ThreadPool changes looks fine overall.

The ThreadPool implementation so far does not have any dependency on global state, it is slightly annoying to me to introduce some here.

llvm/lib/Support/ThreadPool.cpp
57–58

Can you handle this with RAII? llvm::make_scope_exit outside of the loop?

int3 updated this revision to Diff 405086.Feb 1 2022, 1:39 PM
int3 marked an inline comment as done.

scope_exit

int3 planned changes to this revision.Feb 1 2022, 1:40 PM
int3 requested review of this revision.
int3 updated this revision to Diff 406147.Feb 4 2022, 7:53 PM

rebase

int3 added a comment.Feb 7 2022, 9:33 AM

@mehdi_amini is this good to go? I think the test failures are spurious (but not 100% sure)

(There are visible changes around the name associated with the tracer in lld that someone familiar with this in lld should approve.)

llvm/lib/Support/ThreadPool.cpp
37

It seems that we'll always use the instance initialized in the thread that calls "grow". Also, this instance has to be setup before the call to grow, and the thread can't reinitialize it for the lifetime duration of the ThreadPool if I understand correctly.

I'm not sure this makes sense in the full generality of the ThreadPool API?

(There are visible changes around the name associated with the tracer in lld that someone familiar with this in lld should approve.)

Maybe add @MaskRay.

@int3, apologies if my suggestion of adding into ThreadPool has made this more complicated! Your original change may be okay as a quicker fix, moving to ThreadPool could be a follow up.

llvm/lib/Support/ThreadPool.cpp
37

I'm not sure this makes sense in the full generality of the ThreadPool API?

Do we agree that time tracing all threads used by the ThreadPool is desirable and worth pursuing?

mehdi_amini added inline comments.Feb 9 2022, 2:00 AM
llvm/lib/Support/ThreadPool.cpp
37

This could be a useful feature to have a tracing feature for the ThreadPool. I'm not sure about:

  • The expected behavior in terms of threading (with respect to the thread pool creation, the growing of the pool, or the enqueuing of a task).
  • The current TimeTraceProfiler which is non-trivially coupled to some global state, making this all harder to reason about.
int3 added a comment.Feb 9 2022, 5:10 PM

(There are visible changes around the name associated with the tracer in lld that someone familiar with this in lld should approve.)

I think they're pretty safe. But yeah, I will ping the other LLD people about that once we've settled on the core ThreadPool changes.

@int3, apologies if my suggestion of adding into ThreadPool has made this more complicated! Your original change may be okay as a quicker fix, moving to ThreadPool could be a follow up.

No worries & no rush; I'm happy to hash this out :) Was just a bit busy for the last couple of days.

llvm/lib/Support/ThreadPool.cpp
37

It seems that we'll always use the instance initialized in the thread that calls "grow".

Actually we are just copying the value of its TimeTraceGranularity member and using that to initialize a new thread-local instance. I could have just copied the TimeTraceGranularity value itself, but I figured this was a slightly nicer abstraction -- if in the future we add more fields to TimeTraceProfilerInstance, we can keep the same initialization method signature.

You are right that this makes things a little less general though. In particular, there is no way to have different granularities per profiler instance -- every thread must use the same value. IDK if that's an issue... after all, the places where the TimeProfiler is getting used ATM don't take advantage of this flexibility.

the thread can't reinitialize it for the lifetime duration of the ThreadPool if I understand correctly

I don't think there's a use case for reinitializing it...

That said I didn't write the TimeProfiler, so maybe @russell.gallop can confirm.

int3 added a reviewer: MaskRay.Feb 9 2022, 5:11 PM
int3 added inline comments.Feb 14 2022, 11:36 PM
llvm/lib/Support/ThreadPool.cpp
37

bump -- @russell.gallop, could you chime in here?

russell.gallop added inline comments.Feb 15 2022, 7:42 AM
llvm/lib/Support/ThreadPool.cpp
37

You are right that this makes things a little less general though. In particular, there is no way to have different granularities per profiler instance -- every thread must use the same value. IDK if that's an issue... after all, the places where the TimeProfiler is getting used ATM don't take advantage of this flexibility.

Yes, I don't imagine it is very useful to have different granularities.

I don't think there's a use case for reinitializing it...
That said I didn't write the TimeProfiler, so maybe @russell.gallop can confirm.

I'm not sure I follow the use of the Thread Pool where this would be required...

I think @anton-afanasyev added the time profiler, I extended for LLD tracing, but only really with the ThinLTO threading case in mind. I imagine that the profiler could be re-engineered to meet what the thread pool can do, but I'm not really aware of what the "full generality of the thread pool" is so can't really say more than that.

I don't think I'll have time to do this myself. Let me know if you don't have time and I can ask if there is someone around here who can take a look at this.

mehdi_amini added inline comments.Feb 15 2022, 10:16 AM
llvm/lib/Support/ThreadPool.cpp
37

The ThreadPool isn't a "global" thing, but the profiler is. It seems to me that there is a mismatch in terms of lifetime / lifecycle that creeps up here and does not make it a natural fit for the ThreadPool to know directly about the profiler.

For example a MLIRContext (similar to LLVMContext) owns a ThreadPool, you may through the lifetime of a process create and destroy multiple ThreadPool.

The thread which creates the ThreadPool isn't necessarily the one that will enqueue work to the pool. In MLIR for example, the pool will get work enqueued when a pass manager is executed: what does it mean in terms of profiler instance?

The kind of things that can be surprising with the implementation in this patch is that a sequence like:

  • Create a ThreadPool
  • Setup profiler
  • Grow the pool to more thread
  • schedule

would lead to only some threads having the profiler enabled (the ones that grew post setup).
More simpler "gotcha" is:

  • Create a ThreadPool (indirectly, from the user point of it is "initialize the compiler" or "Create a MLIRContext")
  • Setup profiler
  • run

this wouldn't get any thread other that the current one having profiler setup.
However many construct like parallel_for will schedule iterations in the pool but also use the current thread to run work, so work executed in the current thread would get different behavior from the work setup in the pool.
Variant of this behavior can include:

  • Setup profiler
  • Create a ThreadPool (indirectly, from the user point of it is "initialize the compiler" or "Create a MLIRContext")
  • parallel_for from another thread

This time the parallel iteration in the thread pool would have the profiler setup, but not the ones directly executed in the calling thread.

int3 planned changes to this revision.Feb 18 2022, 8:46 AM
int3 added inline comments.
llvm/lib/Support/ThreadPool.cpp
37

got it, I see why you are uncomfortable with the change now. I will revert to the earlier non-ThreadPool-invasive version of this diff. Thanks for the insight!

(Just wanted to say I know ld.lld --time-trace has problems. For example if I run

ld.lld @response.txt --threads=8 --time-trace -o clang
jq -r '.traceEvents[] | select((.name|contains("Total")) or (.name|contains("Write"))) | "\(.dur/1000000) \(.name) \(.args)"' < clang.time-trace

I can sometimes find missing trace events. But I haven't had time to look into this. Sorry!)

int3 abandoned this revision.Mar 16 2022, 2:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 2:32 PM