This is an archive of the discontinued LLVM Phabricator instance.

Add ReadCStringFromMemory for faster string reads
ClosedPublic

Authored by aadsm on May 27 2019, 3:25 PM.

Details

Summary

This is the fifth patch to improve module loading in a series that started here (where I explain the motivation and solution): D62499

Reading strings with ReadMemory is really slow when reading the path of the shared library. This is because we don't know the length of the path so use PATH_MAX (4096) and these strings are actually super close to the boundary of an unreadable page. So even though we use process_vm_readv it will usually fail because the read size spans to the unreadable page and we then default to read the string word by word with ptrace.

This new function is very similar to another ReadCStringFromMemory that already exists in lldb that makes sure it never reads cross page boundaries and checks if we already read the entire string by finding '\0'.

I was able to reduce the GetLoadedSharedLibraries call from 30ms to 4ms (or something of that order).

Event Timeline

aadsm created this revision.May 27 2019, 3:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2019, 3:25 PM
aadsm edited the summary of this revision. (Show Details)
labath added a comment.EditedMay 28 2019, 2:10 AM

It sounds like these improved reads would be beneficial for all kinds of reads, and not just for reading c strings. Could this chunking logic be implemented directly in ReadMemory?

That would also allow you to test this by sending m packets which are deliberately chosen to cross readable/unreadable page boundaries in interesting ways...

It sounds like these improved reads would be beneficial for all kinds of reads, and not just for reading c strings. Could this chunking logic be implemented directly in ReadMemory?

We should probably do this in ReadMemory directly to ensure we have the fastest memory reads possible for all situations. Then ignore my inline comment.

That would also allow you to test this by sending m packets which are deliberately chosen to cross readable/unreadable page boundaries in interesting ways...

BTW: LLDB will usually always send down aligned reads due to it populating its memory cache.

lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
1492–1494

Seems like this function could be added to the NativeProcess base class? We would need to add "virtual size_t NativeProcess::GetCacheLineSize() = 0;" to NativeProcess too in that case.

2078–2079

do we really need to read PATH_MAX each time?Seems like ReadCStringFromMemory could read 32 bytes at a time and check for NULL characters along the way and making sure we don't cross page boundaries?

aadsm added a comment.May 28 2019, 7:30 AM

Regarding changing the ReadMemory, yes, I was going to submit another patch with it. I was going to do it in a different way though, where I would read as much as possible with the process_vm_readv (fwiw, this function does not fail when it tries to read cross page, it will just partial read) and then read the rest with ptrace. But I guess it does makes more sense to adopt the same page boundary read strategy.

Regarding changing the ReadMemory, yes, I was going to submit another patch with it. I was going to do it in a different way though, where I would read as much as possible with the process_vm_readv (fwiw, this function does not fail when it tries to read cross page, it will just partial read) and then read the rest with ptrace.

I think this is fine too, probably even better. I was going to suggest the same thing, but then I convinced myself that this wouldn't work in case the read *starts* in the unreadable memory region.

However, now that I think about it, that is nonsense, because there is no way for us to say to the user that "we failed to read some initial bytes, but this is the memory contents after that". So just using process_vm_readv, and finishing up with ptrace sounds fine to me.

aadsm added a comment.May 28 2019, 8:07 PM

However, now that I think about it, that is nonsense, because there is no way for us to say to the user that "we failed to read some initial bytes, but this is the memory contents after that". So just using process_vm_readv, and finishing up with ptrace sounds fine to me.

The reason why I thought reading page by page still made sense in the ReadMemory is to cover for the situation where we cross 3 pages and the page in the middle is not readable. However, that might not be realistic to happen?

However, now that I think about it, that is nonsense, because there is no way for us to say to the user that "we failed to read some initial bytes, but this is the memory contents after that". So just using process_vm_readv, and finishing up with ptrace sounds fine to me.

The reason why I thought reading page by page still made sense in the ReadMemory is to cover for the situation where we cross 3 pages and the page in the middle is not readable. However, that might not be realistic to happen?

Yeah, you're right. This can even happen with just two pages if the first page is not readable-by-process_vm_readv but it *is* readable-by-ptrace. I think that's pretty unlikely to happen, but it is possible in theory. I'll leave it up to you to figure out whether it's worth implementing that.

That would also allow you to test this by sending m packets which are deliberately chosen to cross readable/unreadable page boundaries in interesting ways...

BTW: LLDB will usually always send down aligned reads due to it populating its memory cache.

Yep, this kind of test would have to be done at the lldb-server level. Which is reasonable, as we're actually testing lldb-server behavior.

lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
1492–1494

There's llvm::sys::Process::getPageSizeEstimate(), which can be used directly here. Since lldb-server is running on the same host as the debugged process, we don't even have to pipe things through the NativeProcess class.

aadsm updated this revision to Diff 202329.May 30 2019, 3:54 PM

Moved ReadCStringFromMemory to NativeProcessProtocol and address some other comments.

aadsm updated this revision to Diff 203084.Jun 4 2019, 11:06 PM

Make sure we always return a null terminated string

aadsm updated this revision to Diff 203658.Jun 7 2019, 6:09 PM

Fix a little bug and add tests

Given that you're improving the linux implementation (which is the only one that benefits from chunked reads) of ReadMemory in https://reviews.llvm.org/D62715, does it make sense to do the chunking here? After all, if an implementation (like NetBSD) has an efficient ptrace interface for reading memory, then the chunked reads might actually be pessimizing it. Theoretically, the linux implementation could be improved further to use the chunking ("If I am using ptrace, and have just crossed a page boundary, try process_vm_readv again in case the new page happens to be readable"), but I'm not sure if that is really necessary.

lldb/source/Host/common/NativeProcessProtocol.cpp
686

s/auto/void */

687

s/NULL/nullptr/

696–697

curr_addr+=bytes_read

lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
2079

It doesn't look like the reinterpret_cast should be needed here.

lldb/unittests/Host/NativeProcessProtocolTest.cpp
129–134

I'm wondering how will the caller know that the read has been truncated. Given that ReadMemory already returns an error in case of partial reads, maybe we could do the same here ? (return an error, but still set bytes_read and the string as they are now). What do you think ?

aadsm marked an inline comment as done.Jun 12 2019, 10:00 PM

Yeah, that's a good question and I did think about it at the time I wrote D62715. Here's my rationale:
Regarding the chunk reading in the ReadMemory I was not sure because I don't know how common that is and it is faster to read the entire size without chunking (the code is also less complex this way). I'd imagine that in the common case we usually know what we're reading? (e.g.: reading the entire function assembly code)
That's why I thought about the string "extraction" as a special case for memory reading. However, even if we had a ReadMemory that did chunking I'd still want to do chunking in the ReadCStringFromMemory. Otherwise we'd be reading much more than needed for a string because there is no way for the ReadMemory to know it could stop (after finding a '\0') so it might end up reading say 2KB word by word with ptrace. (I did think for 5s to pass an optional stop character to ReadMemory but that would be shoehorning an edge case into a generic and simple function, it would be better to put the chunk reading logic into a separate thing that both ReadMemory and ReadCStringFromMemory could reuse).

Valid concern on other platforms that may have better ptrace implementations. @clayborg you suggested to move it from Linux to the Protocol class, any thoughts on this?

lldb/unittests/Host/NativeProcessProtocolTest.cpp
129–134

I'm a bit hesitant on this one, because it's different semantics.
ReadMemory returns an error when it was not able to read everything that it was told to.

In this test case we were able to read everything but didn't find a null terminator. I decided to set a null terminator in this case and say that we read 1 less byte in order to 1) actually have a C string (I don't want people to strlen it and fail) and 2) allow people to read in chunks if they want (however, I don't know how useful this will be if at all) or allow them to partially read a string or even to realize that the string is larger than originally expected.

There is a way to know if it didn't read a full string with strlen(string) == bytes_read but that'd be O(n). You can also do bytes_read == sizeof-1 && string[size-2] != '\0' but it kind of sucks.
I have no good solution here :D unless we want to return an enum with the 3 different options.

Yeah, that's a good question and I did think about it at the time I wrote D62715. Here's my rationale:
Regarding the chunk reading in the ReadMemory I was not sure because I don't know how common that is and it is faster to read the entire size without chunking (the code is also less complex this way). I'd imagine that in the common case we usually know what we're reading? (e.g.: reading the entire function assembly code)
That's why I thought about the string "extraction" as a special case for memory reading. However, even if we had a ReadMemory that did chunking I'd still want to do chunking in the ReadCStringFromMemory. Otherwise we'd be reading much more than needed for a string because there is no way for the ReadMemory to know it could stop (after finding a '\0') so it might end up reading say 2KB word by word with ptrace. (I did think for 5s to pass an optional stop character to ReadMemory but that would be shoehorning an edge case into a generic and simple function, it would be better to put the chunk reading logic into a separate thing that both ReadMemory and ReadCStringFromMemory could reuse).

Ok, I think I understand what you mean. It does make sense to do chunked reads here, as most strings will be relatively short in most cases, so it does not make sense to read full 4k bytes only to find the null terminator on byte 2. However, in this case one would expect to have some fixed chunk size (128, or whatever would be the "common" string length), and read the data like that. So seeing page size being used as input here is surprising (and it means you can still end up reading 4k bytes slowly, if you happen to start reading near the beginning of a page). However, I also get why you're implementing it this way -- you're betting on the fact that most read will end up in the "fast" memory, and just want to avoid wandering into the "slow" part needlessly. That sounds like a reasonable thing to assume. So overall, I think you've convinced me about this, and we can keep this implementation, until evidence saying otherwise shows up.

lldb/unittests/Host/NativeProcessProtocolTest.cpp
101

It would be nice to have one more test where reads cross the page boundary, to make sure chunks are concatenated properly.

129–134

In this test case we were able to read everything but ...

Is that really true? We were told to read a c string. We did not find a c string, so instead we chose to mangle the data that we have read into a c string. That seems a bit error-ish.

I also find it confusing that in this (truncated) case the returned value corresponds with the length of the string, while in the "normal" case in includes the nul terminator. I think it would be more natural to do it the other way around, as that would be similar to what snprintf does.

How about we avoid this issue entirely, and have the function return Expected<StringRef> (the StringRef being backed by the buffer you pass in as the argument) instead? This way you don't have to do any null termination, as the StringRef carries the length implicitly. And one can tell that the read may be incomplete by comparing the stringref size with the input buffer size.

aadsm marked an inline comment as done.Jun 13 2019, 11:30 AM

I see what you mean, this is a good point. Yes, like you say I'm assuming the string I want to read is in a page that I can read using the fast method. And in this situation I prefer to minimize the number of calls to ReadMemory than the amount of data I'll read (given that I'll be reading from within the same page that is already cached, and also assuming the destination is also cached (probably true if in the stack)). I also wanted to side step the question of what is a good string average size because I don't have good data to back it up.
So yeah, optimizing for the case I know it exists, and we can tweak this better once we have other cases to analyse.

lldb/unittests/Host/NativeProcessProtocolTest.cpp
129–134

Is that really true? We were told to read a c string. We did not find a c string, so instead we chose to mangle the data that we have read into a c string. That seems a bit error-ish.

That is fair.

How about we avoid this issue entirely, and have the function return Expected<StringRef> (the StringRef being backed by the buffer you pass in as the argument) instead? This way you don't have to do any null termination, as the StringRef carries the length implicitly. And one can tell that the read may be incomplete by comparing the stringref size with the input buffer size.

I think my only concern with this is that people might think the passed in buffer would be null terminated after the call (this is true for snprintf). I'm not sure how to set the expectation that it wouldn't...

labath added inline comments.Jun 13 2019, 12:01 PM
lldb/unittests/Host/NativeProcessProtocolTest.cpp
129–134

That's true, but on the other hand, snprintf does not use StringRefs, only c strings. If the users only work with the returned StringRef (like most of our code should), then they shouldn't care about whether it's nul terminated or not.

I'd also be fine with nul-terminating the buffer, but still returning a StringRef to the part which does not include the nul terminator. That would make this function similar to how llvm::Twine::toNullTerminatedStringRef works (though the fact that they explicitly put "NullTerminated" in the name shows that one normally does not expect StringRefs to be null terminated).

This way, we'd lose the ability to tell whether the read was truncated or not, but that's probably not that important. We'd still have the ability to work with the string we've read in an llvm-like manner and avoid any additional length computations..

aadsm updated this revision to Diff 204977.Jun 16 2019, 5:02 PM
  • Make ReadCStringFromMemory return llvm::Expected<llvm::StringRef>>
  • Add documentation
  • Address other comments
  • Created static const size_t g_cache_line_size to avoid calling llvm::sys::Process::getPageSizeEstimate() all the time, that thing is expensive.
aadsm updated this revision to Diff 204979.Jun 16 2019, 5:07 PM
  • Address missing comments
labath accepted this revision.Jun 17 2019, 12:07 AM

LGTM, modulo the inline comments.

lldb/source/Host/common/NativeProcessProtocol.cpp
25–26

Make this a local static in ReadCStringFromMemory. There's no need to execute this immediately on program startup..

lldb/unittests/Host/NativeProcessProtocolTest.cpp
101

How about this? I still think it would be useful to have it, as the cross-page handling is the main source of complexity in this function.

This revision is now accepted and ready to land.Jun 17 2019, 12:07 AM
aadsm updated this revision to Diff 205211.EditedJun 17 2019, 4:10 PM
  • Add a test to cover cross page reads
  • Make cache_line_size a static const variable inside the function.
  • Fix how we're passing name_buffer to ReadCStringFromMemory.
labath accepted this revision.Jun 17 2019, 11:13 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 18 2019, 4:25 PM

There is a regression on Fedora 30 x86_64:

Failing Tests (25):
    LLDB :: tools/lldb-mi/breakpoint/break-insert-enable-pending.test
    lldb-Suite :: expression_command/call-function/TestCallStdStringFunction.py
    lldb-Suite :: expression_command/formatters/TestFormatters.py
    lldb-Suite :: expression_command/radar_9531204/TestPrintfAfterUp.py
    lldb-Suite :: expression_command/test/TestExprs.py
    lldb-Suite :: functionalities/breakpoint/cpp_exception/TestCPPExceptionBreakpoint.py
    lldb-Suite :: functionalities/breakpoint/global_constructor/TestBreakpointInGlobalConstructor.py
    lldb-Suite :: functionalities/breakpoint/move_nearest/TestMoveNearest.py
    lldb-Suite :: functionalities/inferior-assert/TestInferiorAssert.py
    lldb-Suite :: functionalities/load_unload/TestLoadUnload.py
    lldb-Suite :: functionalities/load_using_paths/TestLoadUsingPaths.py
    lldb-Suite :: functionalities/thread/num_threads/TestNumThreads.py
    lldb-Suite :: functionalities/unwind/noreturn/TestNoreturnUnwind.py
    lldb-Suite :: lang/c/conflicting-symbol/TestConflictingSymbol.py
    lldb-Suite :: lang/c/function_types/TestFunctionTypes.py
    lldb-Suite :: lang/c/global_variables/TestGlobalVariables.py
    lldb-Suite :: lang/c/shared_lib/TestSharedLib.py
    lldb-Suite :: lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py
    lldb-Suite :: lang/cpp/auto/TestCPPAuto.py
    lldb-Suite :: lang/cpp/exceptions/TestCPPExceptionBreakpoints.py
    lldb-Suite :: lang/cpp/namespace_definitions/TestNamespaceDefinitions.py
    lldb-Suite :: python_api/thread/TestThreadAPI.py
    lldb-Suite :: tools/lldb-mi/breakpoint/TestMiBreak.py
    lldb-Suite :: tools/lldb-server/libraries-svr4/TestGdbRemoteLibrariesSvr4Support.py
    lldb-Suite :: tools/lldb-vscode/breakpoint/TestVSCode_setExceptionBreakpoints.py

This line may is maybe a different issue:

lldb-Suite :: functionalities/unwind/noreturn/TestNoreturnUnwind.py
lldb-Suite :: tools/lldb-server/libraries-svr4/TestGdbRemoteLibrariesSvr4Support.py

@aadsm Is there any update on the regression fix? This patch also breaks the C++ module test suite, so i would prefer if we could revert it if the fix takes longer.

aadsm added a comment.Jun 25 2019, 2:43 PM

I can't repro it right now so I prefer to revert it (I'm curious how this new function could influence those tests though). I actually thought this was sorted, sorry about that.

It seems to break looking up symbols from shared libraries. A simple reproducer is debugging a simple "Hello World" program in C on Linux (Arch Linux in my case, but it seems to also affect other distributions)

#include <stdio.h>

int main() {
  printf("Hello World\n");
}

and then trying to eval printf:

(lldb) target create "printftest"
Current executable set to 'printftest' (x86_64).
(lldb) b main
Breakpoint 1: where = printftest`main + 8 at main.cpp:4:3, address = 0x0000000000001108
(lldb) r
Process 56813 launched: '/home/teemperor/llvm/test/printftest' (x86_64)
Process 56813 stopped
* thread #1, name = 'printftest', stop reason = breakpoint 1.1
    frame #0: 0x0000555555555108 printftest`main at main.cpp:4:3
   1    #include <stdio.h>
   2   
   3    int main() {
-> 4      printf("Hello World\n");
   5    }
(lldb) p printf("a")
error: Couldn't lookup symbols:
  printf

I was debugging it yesterday but I have no real result yet. There is/was a problem that initially it is run with:

intern-state     ProcessGDBRemote::GetLoadedModuleList
intern-state     parsing: <library-list-svr4 version="1.0"><library name="/lib64/ld-linux-x86-64.so.2" lm="0x7ffff7ffd990" l_addr="0x7ffff7fd2000" l_ld="0x7ffff7ffcdf0" /><library name="linux-vdso.so.1" lm="0x7ffff7ffe6f0" l_addr="0x7ffff7fd1000" l_ld="0x7ffff7fd1348" /></library-list-svr4>
intern-state     found (link_map:0x7ffff7ffd990, base:0x7ffff7fd2000[offset], ld:0x7ffff7ffcdf0, name:'/lib64/ld-linux-x86-64.so.2')
intern-state     found (link_map:0x7ffff7ffe6f0, base:0x7ffff7fd1000[offset], ld:0x7ffff7fd1348, name:'linux-vdso.so.1')
intern-state     found 2 modules in total

but later the module list is not refreshed with more libraries, DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit just does not happen.
PASS vs. FAIL
But there may be also some other patch involved as one regression was fixed by rL363770 but that was not everything.

labath added a comment.EditedJul 1 2019, 6:08 AM

I've also reverted the preceeding patch in this series as reverting this one has caused one of the other tests to start failing (I've tried to make a more surgical fix first, but @jankratochvil pointed out that this basically reintroduced the problem that we were trying to solve by reverting this).

If I understand correctly the problem correctly, the issue is that the svr4 support on the client side is broken right now, and enabling svr4 on the server exposes that problem. If so, then it seems to me that the best course of action here is:

  • introduce a setting which controls whether the client uses svr4 (plugin.process.gdb-remote.use-svr4 ?). Have it default to false
  • reapply the preceeding patch with all fixes (i.e., revert rL364751)
  • reapply this patch
  • fix svr4 clientside (either D63868 or D62504). This patch can also flip the new setting to true, or it can be done as a followup

Originally, I was going to suggest hard-disabling the client support as step 1, but now I think a setting is a better option because:

  • it will enable people to work around svr4 issues if they happen to encounter them in the future
  • (my favourite) it will enable to test the non-svr4 code path without needing to rebuild with xml support disabled
  • it will enable people to *enable* svr4 support if it happened to work for them (in conjunction with a custom server, I guess) while steps 2--4 are underway

WDYT?

aadsm added a comment.Jul 1 2019, 7:39 AM

Yap, that makes a lot of sense to me. I looked into this over the weekend and figured out the reason why it was broken (tl;dr: we can't use LoadModules) and created a tidier version of D63868 that, like D62504 avoids fetching the svr4 packet twice.

Will put it up in a few minutes. Yesterday the git server (or phabricator's mysql) was down or something, it seemed like there was not enough disk space.

@jankratochvil I'm planning to follow Pavel's plan now that the new setting has landed and finally got some time for this.

It will be:

  • Revert rL364751 (which is a revert itself)
  • Revert 9c10b620c061 (will re-apply this one)
  • Land D64013 once it's approved (which I just realized it hasn't yet, will ping it)

Does it sound good to you?

Does it sound good to you?

Yes, that sounds great, thanks. Sorry I am not participating these days.

aadsm added a comment.Jul 23 2019, 1:41 PM

I've just done these 2 steps:

  • Revert rL364751 (which is a revert itself)
  • Revert 9c10b620c061 (will re-apply this one)

Should be fine since now we have an option that is false by default. Will then land D64013.