This is an archive of the discontinued LLVM Phabricator instance.

[Debuginfod] Don't depend on Content-Length.
ClosedPublic

Authored by mysterymath on Mar 15 2022, 10:45 AM.

Details

Summary

The present implementation of debuginfod lookups requires the
Content-Length field to be populated in the HTTP server response.
Unfortunately, Content-Length is optional, and there are some real
scenarios where it's missing. (For example, a Google Cloud Storage
server doing on-the-fly gunzipping.)

This changes the debuginfod response handler to directly stream the
output to the cache file as it is received. In addition to allowing
lookups to proceed without a Content-Lenght, it seems somewhat more
straightforward to implement, and it allows the disk I/O to be
interleaved with the network I/O.

Diff Detail

Event Timeline

mysterymath created this revision.Mar 15 2022, 10:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2022, 10:45 AM
mysterymath requested review of this revision.Mar 15 2022, 10:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2022, 10:45 AM
noajshu added inline comments.Mar 15 2022, 12:51 PM
llvm/include/llvm/Debuginfod/HTTPClient.h
9–10
noajshu added inline comments.Mar 15 2022, 1:01 PM
llvm/lib/Debuginfod/Debuginfod.cpp
122

Just out of curiousity, why use final? Is this a style convention or is there some performance impact?

Update header comments; remove final.

mysterymath marked an inline comment as done.Mar 15 2022, 1:19 PM
mysterymath added inline comments.
llvm/include/llvm/Debuginfod/HTTPClient.h
9–10

Made these comments a bit more abstract.

llvm/lib/Debuginfod/Debuginfod.cpp
122

No reason, just a copy/pasted oversight from BufferedHTTPResponseHandler. Removed for parsimony.

noajshu added inline comments.Mar 15 2022, 1:40 PM
llvm/lib/Debuginfod/HTTPClient.cpp
9–10
10–11
noajshu added a comment.EditedMar 15 2022, 2:01 PM

Thanks for implementing this! I think it mostly looks good and should have the side benefit of saving memory for large downloads.

I hesitate to endorse the new HTTPClient::responseCode() method. It seems more intuitive to me that the HTTP response code should be stored inside the request handler object (this was the previous design). This is because there is one response code affiliated with each request. Indeed, an application can use the same HTTPClient from multiple threads; this new design prevents us from learning the individual response codes for concurrent requests.

If you agree, I think you can just un-delete the handleResponseCode declaration + definition and its invocation inside perform().

mysterymath marked an inline comment as done.Mar 15 2022, 2:28 PM

Indeed, an application can use the same HTTPClient from multiple threads; this new design prevents us from learning the individual response codes for concurrent requests.

If I'm interpreting the libcurl docs correctly, it doesn't seem like it's legal to issue multiple curl_easy_performs on the same Curl handle: https://curl.se/libcurl/c/curl_easy_perform.html
HTTPClient should still be usable in a multi-threaded fashion, but you'd need one HTTPClient per thread.

If you agree, I think you can just un-delete the handleResponseCode declaration + definition and its invocation inside perform().

I initially tried to do this, but as previously implemented, handleResponseCode only gets called after the full data transfer is complete, which is too late to decide whether or not to create a cache file.
We could parse the response code itself in handleHeaderLine, but that seemed a bit heavyweight, and I wasn't sure exactly how it'd interact with CURLOPT_FOLLOWLOCATION.

Indeed, an application can use the same HTTPClient from multiple threads; this new design prevents us from learning the individual response codes for concurrent requests.

If I'm interpreting the libcurl docs correctly, it doesn't seem like it's legal to issue multiple curl_easy_performs on the same Curl handle: https://curl.se/libcurl/c/curl_easy_perform.html
HTTPClient should still be usable in a multi-threaded fashion, but you'd need one HTTPClient per thread.

Ah, thank you for correcting me. I agree there should be one HTTPClient and Curl handle per thread.

If you agree, I think you can just un-delete the handleResponseCode declaration + definition and its invocation inside perform().

I initially tried to do this, but as previously implemented, handleResponseCode only gets called after the full data transfer is complete, which is too late to decide whether or not to create a cache file.
We could parse the response code itself in handleHeaderLine, but that seemed a bit heavyweight, and I wasn't sure exactly how it'd interact with CURLOPT_FOLLOWLOCATION.

If you like, you could set a status code member of the Handler when you check to determine whether to create a cache file. In this case you could even have the HTTPClient::statusCode() be a private method and make the Handler a friend.
On the other hand, since the multithreaded access is not an issue, perhaps it's not worth the trouble to make the HTTPClient interface stateless.

On the other hand, since the multithreaded access is not an issue, perhaps it's not worth the trouble to make the HTTPClient interface stateless.

I'm inclined to think this way; it seems difficult to a-priori determine the best interface to balance future uses and future backing implementations, given how minimal present footprint is.
Keeping it as simple as possible (without leaking boilerplate to the caller) should help make it easier to change when we need to.

This revision is now accepted and ready to land.Mar 16 2022, 12:19 PM
This revision was landed with ongoing or failed builds.Mar 21 2022, 10:27 AM
This revision was automatically updated to reflect the committed changes.
llvm/lib/Debuginfod/HTTPClient.cpp