This is an archive of the discontinued LLVM Phabricator instance.

[llvm] [Debuginfod] DebuginfodCollection and DebuginfodServer for tracking local debuginfo.
ClosedPublic

Authored by noajshu on Nov 30 2021, 10:12 PM.

Details

Summary

This library implements the class DebuginfodCollection, which scans a set of directories for binaries, classifying them according to whether they contain debuginfo. This also provides the DebuginfodServer, an HTTPServer which serves debuginfod's /debuginfo and /executable endpoints. This is intended as the final new supporting library required for llvm-debuginfod.

As implemented here, DebuginfodCollection only finds ELF binaries and DWARF debuginfo. All other files are ignored. However, the class interface is format-agnostic. Generalizing to support other platforms will require refactoring of LLVM's object parsing libraries to eliminate use of report_fatal_error (e.g. when reading WASM files), so that the debuginfod daemon does not crash when it encounters a malformed file on the disk.

The DebuginfodCollection is tested by end-to-end tests of the debuginfod server (D114846).

Diff Detail

Event Timeline

noajshu created this revision.Nov 30 2021, 10:12 PM
noajshu requested review of this revision.Nov 30 2021, 10:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2021, 10:12 PM
noajshu edited the summary of this revision. (Show Details)Nov 30 2021, 10:20 PM
noajshu edited the summary of this revision. (Show Details)Nov 30 2021, 10:22 PM
noajshu edited the summary of this revision. (Show Details)
phosek added a subscriber: phosek.Dec 2 2021, 12:01 PM
phosek added inline comments.
llvm/lib/Debuginfod/Debuginfod.cpp
249

I expect this loop to be the performance bottleneck, so I think we should consider using a ThreadPool here as well to process files in parallel. That's the strategy used by elfutils' debuginfod as well.

This would likely require rethinking the API to allow sharing the thread pool between the file processing and request handling in D114846.

noajshu edited the summary of this revision. (Show Details)Dec 15 2021, 7:25 PM
noajshu updated this revision to Diff 394931.Dec 16 2021, 11:14 AM
noajshu edited the summary of this revision. (Show Details)

Add DebuginfodServer class. Add multithreaded directory scanning (unstable).

noajshu retitled this revision from [llvm] [DebugInfo] DebuginfodCollection for tracking local debuginfo. (WIP) to [llvm] [DebugInfo] DebuginfodCollection and DebuginfodServer for tracking local debuginfo. (WIP).Dec 16 2021, 11:18 AM
noajshu edited the summary of this revision. (Show Details)
noajshu updated this revision to Diff 395000.Dec 16 2021, 2:36 PM

Add federation to other debuginfod servers.

llvm/lib/Debuginfod/Debuginfod.cpp
249

Good idea!
I've added this, but I find the behavior unstable when the concurrency is >1 and a large directory is scanned.
I've searched around for the bug in my code and no luck. I'm wondering if it's possible some of the supporting libraries I'm using are not thread-safe or are leaking memory. But it's more likely I'm missing something obvious. Anyways, going to upload as it's WIP and maybe there is a better task queue architecture you could recommend.
PS: the TaskQueue in LLVM is not appropriate as it is for serialized sequence of tasks.

noajshu updated this revision to Diff 403095.Jan 25 2022, 6:00 PM
noajshu retitled this revision from [llvm] [DebugInfo] DebuginfodCollection and DebuginfodServer for tracking local debuginfo. (WIP) to [llvm] [DebugInfo] DebuginfodCollection and DebuginfodServer for tracking local debuginfo..

Fix concurrency bug and rebase against main.

noajshu marked an inline comment as done.Jan 25 2022, 6:03 PM
noajshu added inline comments.
llvm/lib/Debuginfod/Debuginfod.cpp
249

The concurrency bug has been fixed.

noajshu updated this revision to Diff 406295.Feb 6 2022, 2:54 PM
noajshu edited the summary of this revision. (Show Details)
noajshu added reviewers: dblaikie, phosek.
noajshu retitled this revision from [llvm] [DebugInfo] DebuginfodCollection and DebuginfodServer for tracking local debuginfo. to [llvm] [Debuginfod] DebuginfodCollection and DebuginfodServer for tracking local debuginfo..Feb 6 2022, 3:02 PM
fche2 added a subscriber: fche2.Feb 22 2022, 12:30 PM
fche2 added inline comments.
llvm/lib/Debuginfod/Debuginfod.cpp
336

Don't you need a mutex-guard on this operation?

351

Ditto re. mutex-guard?

noajshu added inline comments.Feb 22 2022, 1:36 PM
llvm/lib/Debuginfod/Debuginfod.cpp
336

Thanks so much for catching this!
I propose to switch to RW-Mutex so the readers can read without blocking each other.

noajshu updated this revision to Diff 411708.Feb 27 2022, 4:44 PM

Use RWMutex to protect readers and writers of the debug binaries and binaries collections.

noajshu marked 2 inline comments as done.Feb 27 2022, 4:46 PM
mysterymath added inline comments.Mar 1 2022, 5:07 PM
llvm/include/llvm/Debuginfod/Debuginfod.h
82

I don't think the = 1 does anything, since providing a constructor inhibits generation of a default one.

93

Since all members are public, this should be a struct. This also fits with how it's used, post-construction.

llvm/lib/Debuginfod/Debuginfod.cpp
228–229

This seems a bit odd to unconditionally put to stdout; is this necessary?

Otherwise, the whole routine just becomes

while (true) {
  if (Error Err = update())
    return Err;
  std::this_thread::sleep_for(Interval);
}
return Error::success();
236

llvm_unreachable()?

326–330

Is there an advantage for manually managing the concurrency here, over passing it as an argument to ThreadPool, std::min-ed with the hardware concurrency?
From a cursor look at ThreadPool's API, each async call after the thread pool is full should just more-or-less push a std::function<void()> onto a vector.

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 5:07 PM
noajshu added inline comments.Mar 2 2022, 7:21 AM
llvm/lib/Debuginfod/Debuginfod.cpp
228–229

Thanks, I agree this is awkward. The first update is logged to stdout is so the lit tests in D114846 can tell when it's safe to query the server for debuginfo. If the test client pinged the server before it had found the binary it will return a 404 and the test will flake.

An alternative is to add a time delay before the client pings the server. This is discouraged as there is timing variability across systems.

We could hide this output except for when "verbose" logging is enabled, or similar. Would this be good?

fche2 added inline comments.Mar 2 2022, 7:46 AM
llvm/lib/Debuginfod/Debuginfod.cpp
228–229

(For reference, the elfutils debuginfod exports prometheus metrics about the progress of its operations, so that an external testsuite process can synchronize.)

mysterymath added inline comments.Mar 2 2022, 11:05 AM
llvm/lib/Debuginfod/Debuginfod.cpp
228–229

The first update is logged to stdout is so the lit tests in D114846 can tell when it's safe to query the server for debuginfo. If the test client pinged the server before it had found the binary it will return a 404 and the test will flake.

Ah, this sounds like it'd be something generally useful then; it's not uncommon for servers to print once they're ready to serve.

Maybe we could make the startup process more explicit: initialize the cache, then bring up HTTP server, print something like "Serving on ...", then set up the continuous updates.

304–306

I poked around, and it doesn't look like dbgs() offers any thread-safety guarantees. If not, these async jobs may have have oddly interleaved logs at best, and undefined behavior at worst.

noajshu updated this revision to Diff 412791.Mar 3 2022, 11:43 AM
noajshu marked 4 inline comments as done.

Replace inserts to dbgs() with use of synchronized logging class

noajshu marked an inline comment as done and an inline comment as not done.Mar 3 2022, 11:52 AM

Thanks @mysterymath and @fche2 for many very helpful comments!
@mysterymath suggested we could perform the first update manually, then print a message like "ready to accept connections". This way the test client knows when it can ask for artifacts. This seems logical so I will make this change in D114846.

Regarding logging:
@mysterymath pointed out that logging by inserting strings to dbgs() is thread-unsafe. @fche2 pointed out that elfutils' debuginfod exports Prometheus metrics. I would advocate for keeping some logging facility in the application if only because it is helpful for testing and debugging the code. I am not aware of an existing logging framework in LLVM, so I have created a simple std::queue-based logging class DebuginfodLog using a sys::Mutex to synchronize access. In the future this could be upgraded to support Prometheus exports or other features. Please let me know if you think this will suffice, or if you would prefer an alternative solution to the logging problem. Thanks a lot!

llvm/lib/Debuginfod/Debuginfod.cpp
326–330

As this is within a directory iterator loop, my concern was for when there is a large number of files within that directory. If we add tasks to the ThreadPool faster than they are completed, the memory usage of that vector of std::function<void()>s becomes unbounded. So I thought it best to manage the progress through the loop more manually. What do you think?

mysterymath added inline comments.Mar 3 2022, 12:20 PM
llvm/lib/Debuginfod/Debuginfod.cpp
326–330

I'm unsure whether or not the buildup would ever cause problems in practice; then again, I just finished cleaning up an unbounded memory usage problem that broke in production. I'll defer that determination to others with more experience.

Assuming we keep the semantics here, it seems like what we'd really want is a version of ThreadPool that blocks submission of additional requests if the thread pool is full. This would provide feedback to stop the iterator from producing additional entries that cannot yet be handled (and would thus need to be stored). This seems like a useful abstraction in its own right, and there's definitely prior art.

I'd suggest either wrapping ThreadPool to provide such an API (simplified for the purposes of this file) or adding it as an option to ThreadPool. The first would be a slight modification to the code you have to abstract out the management of NumTasksRemaining. It'd also probably be more elegant to have the blocking async call (s) wait on a condition variable, so that the next task that finishes can signal that there is now a thread available, rather than busy-waiting with sleep().

noajshu added inline comments.Mar 28 2022, 3:23 PM
llvm/lib/Debuginfod/Debuginfod.cpp
326–330

Thanks for these suggestions! I agree on all points.

It's only a small change to ThreadPool to let us wait for room in the queue with a condition variable:

bool queueEmptyUnlocked() { return Tasks.empty(); }

void ThreadPool::waitQueue() {
  // Wait for the queue to be empty
  std::unique_lock<std::mutex> LockGuard(QueueLock);
  CompletionCondition.wait(LockGuard, [&] { return queueEmptyUnlocked(); });
}

I also collected some data to find out when the unbounded memory usage of the queue of jobs could actually be a problem in production. From my measurements, each job in the pool's queue consumes approximately N + 320 bytes of memory, where N is the number of bytes in the file path. For the right system setup, this could indeed consume lots of memory. Briefly, filesystem caching could allow millions of file paths to be traversed in seconds, but actually reading those files could take longer, causing a queue buildup.

However, if these files are ELF binaries they will end up in our StringMap in memory anyways with the current implementation. So the unbounded memory usage will be a problem regardless for this user, if their files are mostly ELF binaries.

Therefore, only the user who has millions of non-ELF files mixed in with their smaller number of ELF binaries could meaningfully benefit from us waiting here to submit more jobs to the queue. For example, a developer user like myself with limited local memory and millions of files that are not ELF binaries. When I plug in the numbers for my own filesystem, I could fit one job for each file in my development directory with about .4 GB of total memory usage. This is a comfortable margin on my own system but I'm unsure about other users. So if it seems reasonable I will just make the small tweaks to ThreadPool API to be safe.

noajshu updated this revision to Diff 419584.Mar 31 2022, 4:12 PM
noajshu marked 2 inline comments as done.

Update ThreadPool API to allow waiting until the queue is empty, allowing Debuginfod to avoid unbounded queue size without manual concurrency management.

noajshu marked 2 inline comments as done.Mar 31 2022, 4:13 PM
mysterymath added inline comments.Mar 31 2022, 4:44 PM
llvm/lib/Debuginfod/Debuginfod.cpp
335

If I'm reading this right, wouldn't this loop dispatch one item to the queue, wait for the queue to be empty, dispatch another item, wait for the queue to be empty, etc. It seems like this disables parallelism entirely.

I'd have expected this to wait until the queue was "not full"; then items would be dispatched until max concurrency was reached, and the next item dispatched the moment a thread becomes free.

noajshu added inline comments.Mar 31 2022, 5:32 PM
llvm/lib/Debuginfod/Debuginfod.cpp
335

I don't think this should disable parallelism entirely as the queue will become empty as soon as the job starts processing in a worker thread, rather than when that job finishes. Let me double check this.

noajshu added inline comments.Mar 31 2022, 6:21 PM
llvm/lib/Debuginfod/Debuginfod.cpp
335

On a closer look, you're right about it disabling parallelism. This is due to ThreadPool's internals! Further modification of the ThreadPool is required.

Also I think it would be simple enough to parametrize as waitQueueSize(size_t Size), blocking until the queue has at most Size tasks.

noajshu updated this revision to Diff 419602.Mar 31 2022, 7:17 PM

Fix incorrect waiting for queue size in ThreadPool.

noajshu updated this revision to Diff 419606.Mar 31 2022, 7:41 PM

Simplify implementation of ThreadPool::waitQueueSize

noajshu added inline comments.Mar 31 2022, 7:43 PM
llvm/lib/Debuginfod/Debuginfod.cpp
335

I corrected the implementation of waitQueueSize. Thank you for catching this!

Although it takes a Size parameter, I leave it as the default of 0 here.

noajshu marked an inline comment as done.May 2 2022, 7:36 PM
noajshu updated this revision to Diff 426567.May 2 2022, 8:30 PM

Update comments in Debuginfod.cpp and remove declaration of functions getCachedOrDownloadSource and getCachedOrDownloadExecutable from Debuginfod.h as they are only used internally by the debuginfod client.

Thank you to all of the reviewers for your extremely helpful comments!

I believe there are no unresolved comments at this point, however I have two questions:

First, I wonder if it wouldn't hurt to add a /source endpoint, which doesn't check the local DebuginfodCollection at all but simply skips straight to using the client.
This way, users who run a local debuginfod server would only have to set all their known public servers in the DEBUGINFOD_URLS variable once when they start llvm-debuginfod. When they use a client tool (like llvm-symbolizer) it would suffice to point to their local server only, as all requests would be federated.

Second, I was wondering if there is any desire to split this out into two or more diffs. For example, although the changes to ThreadPool and Symbolize are quite small I'm happy to separate them out if desired.

Thank you!

noajshu updated this revision to Diff 432195.May 25 2022, 10:46 PM
noajshu edited the summary of this revision. (Show Details)

Add updateIfStale to safely update Collection if it is stale (possibly before a periodic update), and add locking to update() to prevent races with other callers.

It looks like something went a bit screwy with the last patch; the diff from that patch to the previous deletes a bunch of code, including waitQueueSize, which is still called. Would you try re-uploading the latest changes?

First, I wonder if it wouldn't hurt to add a /source endpoint, which doesn't check the local DebuginfodCollection at all but simply skips straight to using the client.
This way, users who run a local debuginfod server would only have to set all their known public servers in the DEBUGINFOD_URLS variable once when they start llvm-debuginfod. When they use a client tool (like llvm-symbolizer) it would suffice to point to their local server only, as all requests would be federated.

Seems to me like we'd eventually want the source endpoint to behave like the other endpoints: first try looking things up locally, then federate. Both behaviors would be naturally supported when we get around to implementing the source endpoint. The behavior you describe would be the one you'd get if you provide an empty list of paths to scan.

Second, I was wondering if there is any desire to split this out into two or more diffs. For example, although the changes to ThreadPool and Symbolize are quite small I'm happy to separate them out if desired.

IMO, the symbolize change seems safe enough, but ThreadPool is a pretty foundational library. It seems wise to separate this out and loop in the folks who've tended to touch the thread pool; they may have more specific opinions.

noajshu updated this revision to Diff 432668.May 27 2022, 4:31 PM

Add back accidentally-deleted changes (thanks @mysterymath !)

Seems to me like we'd eventually want the source endpoint to behave like the other endpoints: first try looking things up locally, then federate. Both behaviors would be naturally supported when we get around to implementing the source endpoint.

I agree. In this case I won't add it here, but leave this for future revisions.

IMO, the symbolize change seems safe enough, but ThreadPool is a pretty foundational library. It seems wise to separate this out and loop in the folks who've tended to touch the thread pool; they may have more specific opinions.

Ok, I will split out those changes separately!

mysterymath added inline comments.May 31 2022, 2:52 PM
llvm/include/llvm/Debuginfod/Debuginfod.h
88

Since the Message is commonly formed by concatenation, making this take const Twine& Message instead should save some heap allocations when logging.

93

From how this is used, it looks like it'd be simpler to make pop() blocking and non-Optional.

llvm/lib/Debuginfod/Debuginfod.cpp
242

Prefer StringRef over auto, since the type is obvious and concrete.

249

const std::string& over auto; I think as written this will even copy the string.

305

EC, from the LLVM variable naming convention

307

I and E, from the LLVM variable naming convention

341

std::lock_guard should work with RWMutexes for exclusive locking, and it's easier to read than doing it manually.

356
364

nit: remove empty line

385

nit: remove empty line

406
412–413

This path will always usually cause a "collection was not stale" error message if the binary was not found, which isn't as good as just "binary not found." It's also surprising that updateIfStale() will throw an error if the collection was not stale; usually methods with that naming convention do nothing, since the name suggests that it's not an error to violate the precondition.

414
424
429
435
443
453
477

This return doesn't do anything; remove.

noajshu updated this revision to Diff 433518.Jun 1 2022, 1:22 PM

Remove changes to ThreadPool, which were moved to D126815.

noajshu updated this revision to Diff 433544.Jun 1 2022, 2:13 PM

Replace manual locking of RWMutex with RAII locks throughout, incorporate other suggested changes.

noajshu marked 18 inline comments as done.Jun 1 2022, 2:18 PM
noajshu added inline comments.
llvm/lib/Debuginfod/Debuginfod.cpp
341

Thanks! Since we're `std::shared_lock<RWMutex> lock(write);

412–413

Good point, how about this? We return an Expected<bool> where if there are no errors during update(), returns whether the collection got updated. If it's not stale, we don't bother checking for the path again.

noajshu marked an inline comment as done.Jun 1 2022, 2:20 PM
noajshu added inline comments.
llvm/lib/Debuginfod/Debuginfod.cpp
341

Good idea, I switched everything to use std:: raii locks and it's much cleaner.

mysterymath accepted this revision.Jun 1 2022, 2:25 PM

LGTM, but please wait until at least sometime next week to submit, so there's space for any last-chance comments on this one.

This revision is now accepted and ready to land.Jun 1 2022, 2:25 PM
noajshu updated this revision to Diff 433586.Jun 1 2022, 4:22 PM

Add back declaration of getDefaultDebuginfodUrls, as it is used by llvm-symbolizer.

noajshu updated this revision to Diff 433646.Jun 1 2022, 9:13 PM

Correct usage of Timer

noajshu updated this revision to Diff 437761.Jun 16 2022, 5:55 PM

Per the discussion on D126815, we switch the concurrency model to one in which workers directly advance the shared (lock-protected) directory iterator.

We add a condition variable to notify the main update thread when the iteration is complete and all workers have returned. We have removed the dependency on waitQueueSize so that we can proceed without D126815.

mysterymath added inline comments.Jun 17 2022, 10:01 AM
llvm/lib/Debuginfod/Debuginfod.cpp
371–377

Since each async call exits as soon as I == E || EC, this block can be replaced with just ThreadPool.wait(), and there's no need to maintain a separate condition variable. (ThreadPool maintains one internally for this purpose.)
It also looks like the decrements of NumActiveWorkers and the outer loop are interleaved in a way that adds some complexity to the situation; this would remove the need for NumActiveWorkers, which clears that up too.

noajshu updated this revision to Diff 438821.Jun 21 2022, 1:40 PM

Rebase against main and replace NumActiveThreads counting with use of ThreadPoolTaskGroup.

noajshu marked an inline comment as done.Jun 21 2022, 1:40 PM
noajshu added inline comments.
llvm/lib/Debuginfod/Debuginfod.cpp
371–377

great idea!

Since the first draft of this revision, ThreadPoolTaskGroup was merged into LLVM. It nicely fits our use case, here we just use IteratorGroup.wait();.

(now that I am aware of the new Task Group feature, I propose to close D126815 altogether, as it does appear quite ad-hoc in light of the clean alternative of using a task group)

This revision was landed with ongoing or failed builds.Jul 6 2022, 1:02 PM
This revision was automatically updated to reflect the committed changes.
noajshu marked an inline comment as done.

Thanks, I'm trying to reproduce this with a local build to see what the problem is.

I have a fix, pushing now.

It's been 6 hours since https://lab.llvm.org/buildbot/#/builders/57/builds/19588 started failing with

FAILED: lib/libLLVMDebuginfod.so.15git

It's been 6 hours since https://lab.llvm.org/buildbot/#/builders/57/builds/19588 started failing with

FAILED: lib/libLLVMDebuginfod.so.15git

Sorry about this, I'll fix this right away.

It's been 6 hours since https://lab.llvm.org/buildbot/#/builders/57/builds/19588 started failing with

FAILED: lib/libLLVMDebuginfod.so.15git

fixed by 819a7f98cd6d