Page MenuHomePhabricator

[lldb/ipv6] Support running lldb tests in an ipv6-only environment.
ClosedPublic

Authored by rupprecht on Sep 8 2020, 3:31 PM.

Details

Summary

When running in an ipv6-only environment where AF_INET sockets are not available, many lldb tests (mostly gdb remote tests) fail because things like 127.0.0.1 don't work there.

This change does not use localhost because that may cause slowdown due to slow dns lookups depending on the user's configuration. Instead, try 127.0.0.1 and fallback to ::1 if that doesn't work. For sockets, create an AF_INET socket first and create an AF_INET6 if that doesn't work.

Diff Detail

Unit TestsFailed

TimeTest
450 mswindows > LLVM.DebugInfo::debuglineinfo-path.ll
Script: -- : 'RUN: at line 7'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\llc.exe -mtriple=x86_64-pc-windows-gnu -O0 -filetype=obj -o C:\ws\w16-1\llvm-project\premerge-checks\build\test\DebugInfo\Output\debuglineinfo-path.ll.tmp < C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\DebugInfo\debuglineinfo-path.ll
140 mswindows > LLVM.DebugInfo::symbolize-build-id.test
Script: -- : 'RUN: at line 1'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\yaml2obj.exe C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\DebugInfo\symbolize-build-id.test -o C:\ws\w16-1\llvm-project\premerge-checks\build\test\DebugInfo\Output\symbolize-build-id.test.tmp
40 mswindows > LLVM.DebugInfo::symbolize-gnu-debuglink.test
Script: -- : 'RUN: at line 1'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\llvm-symbolizer.exe --obj=C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\DebugInfo/Inputs/dwarfdump-test.elf-x86-64.debuglink 0x40113f | c:\ws\w16-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\DebugInfo\symbolize-gnu-debuglink.test
140 mswindows > LLVM.DebugInfo::symbolize.test
Script: -- : 'RUN: at line 1'; echo "C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\DebugInfo/Inputs/dwarfdump-test.elf-x86-64 0x40113f" > C:\ws\w16-1\llvm-project\premerge-checks\build\test\DebugInfo\Output\symbolize.test.tmp.input
160 mswindows > LLVM.DebugInfo/X86::2011-09-26-GlobalVarContext.ll
Script: -- : 'RUN: at line 1'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\llc.exe -mtriple=x86_64-pc-linux-gnu C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\2011-09-26-GlobalVarContext.ll -o C:\ws\w16-1\llvm-project\premerge-checks\build\test\DebugInfo\X86\Output\2011-09-26-GlobalVarContext.ll.tmp -filetype=obj
View Full Test Results (10 Failed)

Event Timeline

rupprecht created this revision.Sep 8 2020, 3:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2020, 3:31 PM
rupprecht requested review of this revision.Sep 8 2020, 3:31 PM

FYI: we switched away from "localhost" a long time ago due to issues with people having a "localhost" entry in their /etc/hosts folder.

FYI: we switched away from "localhost" a long time ago due to issues with people having a "localhost" entry in their /etc/hosts folder.

I assume you mean people *not* having a "localhost" entry? I didn't even realize that was an option. Yes, this patch would break that. I'll have to take another approach, then.

Do you happen to have any references (bugs etc.) for those kinds of issues?

FYI: we switched away from "localhost" a long time ago due to issues with people having a "localhost" entry in their /etc/hosts folder.

I assume you mean people *not* having a "localhost" entry?

I believe the issue was with people having a modified version that replaces localhost with some other IP address.

I didn't even realize that was an option. Yes, this patch would break that. I'll have to take another approach, then.

You might check the "git log" on any files that have "127.0.0.1" in them for details.

Do you happen to have any references (bugs etc.) for those kinds of issues?

FYI: we switched away from "localhost" a long time ago due to issues with people having a "localhost" entry in their /etc/hosts folder.

I assume you mean people *not* having a "localhost" entry?

I believe the issue was with people having a modified version that replaces localhost with some other IP address.

I didn't even realize that was an option. Yes, this patch would break that. I'll have to take another approach, then.

You might check the "git log" on any files that have "127.0.0.1" in them for details.

Do you happen to have any references (bugs etc.) for those kinds of issues?

Looks like rLLDB202424 is the patch that did most of this. There may be more context in rdar://problem/16154630 but I don't have access to that.

FYI: we switched away from "localhost" a long time ago due to issues with people having a "localhost" entry in their /etc/hosts folder.

I assume you mean people *not* having a "localhost" entry?

I believe the issue was with people having a modified version that replaces localhost with some other IP address.

I didn't even realize that was an option. Yes, this patch would break that. I'll have to take another approach, then.

You might check the "git log" on any files that have "127.0.0.1" in them for details.

Do you happen to have any references (bugs etc.) for those kinds of issues?

Looks like rLLDB202424 is the patch that did most of this. There may be more context in rdar://problem/16154630 but I don't have access to that.

Not a lot of context in that radar. Greg wrote:

Anyone who was able to previously reproduce this issue by having DNS lookups on "localhost" take more than 3 seconds should no longer see this issue because we are now using "127.0.0.1" which will just get parsed as 4 numbers, not as a "lookup this name in a DNS registry".

rupprecht updated this revision to Diff 290799.Sep 9 2020, 12:40 PM
  • Don't use "localhost" to avoid dns latency. Instead, prefer either 127.0.0.1 or ::1 directly.
rupprecht updated this revision to Diff 290800.Sep 9 2020, 12:41 PM
  • clang-format
rupprecht edited the summary of this revision. (Show Details)Sep 9 2020, 12:43 PM

This is ready for review now after scrubbing out the "localhost" changes I had earlier.

Thanks for the extra work arounds! LGTM. Pavel?

(Sorry about the delay.) Given the current requirements, I think this patch is fine (excellent even).

That said, I'm not sure whether the original motivation for this requirement (avoiding dns lookups) is still relevant. These days, we communicate with the local debug server via socketpair(2) sockets (which wasn't the case back then), which does not require any dns lookups and is immune to a misconfigured hosts file.

Given that none of the things touched by this patch is extremely critical (GDBRemoteCommunication::ConnectLocally is used for reproducers and the rest is test code), maybe we could use this to test the water and see whether we can start using the network stack the way it's supposed to be used ?

lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunication.cpp
1243

Nit: I believe the std::make_error_code call is redundant (direct comparison to the enum should work fine).

rupprecht updated this revision to Diff 291956.Sep 15 2020, 9:19 AM
rupprecht marked an inline comment as done.
  • Compare directly against the error code instead of hopping through std::make_error_code()

(Sorry about the delay.) Given the current requirements, I think this patch is fine (excellent even).

That said, I'm not sure whether the original motivation for this requirement (avoiding dns lookups) is still relevant. These days, we communicate with the local debug server via socketpair(2) sockets (which wasn't the case back then), which does not require any dns lookups and is immune to a misconfigured hosts file.

The initial dns lookup may still fail, I think? e.g. my initial version of this patch had:

if (llvm::Error error =
        listen_socket.Listen("localhost:0", backlog).ToError())

If the hosts file has a bad value for localhost, that will create a bad socket. That said, I don't know how to reproduce the original issue -- maybe it isn't an issue for some other reason.

Given that none of the things touched by this patch is extremely critical (GDBRemoteCommunication::ConnectLocally is used for reproducers and the rest is test code), maybe we could use this to test the water and see whether we can start using the network stack the way it's supposed to be used ?

Many non-reproducers tests fail w/o the change to GDBRemoteCommunication::ConnectLocally, so it's at least used outside of that.

I can go either way with using localhost or using hard-coded localhost IPs. Whichever way we go, I have a slight preference for keeping the tests and code consistent, e.g. if we use hard-coded IPs in code linked into lldb to allow people with a bad hosts file to use lldb, we should use hard-coded IPs in the tests to allow those same users to develop on lldb.

rupprecht updated this revision to Diff 293887.Sep 23 2020, 4:06 PM
  • Switch back to using localhost for non-socket uses

(Sorry about the delay.) Given the current requirements, I think this patch is fine (excellent even).

That said, I'm not sure whether the original motivation for this requirement (avoiding dns lookups) is still relevant. These days, we communicate with the local debug server via socketpair(2) sockets (which wasn't the case back then), which does not require any dns lookups and is immune to a misconfigured hosts file.

The initial dns lookup may still fail, I think? e.g. my initial version of this patch had:

if (llvm::Error error =
        listen_socket.Listen("localhost:0", backlog).ToError())

If the hosts file has a bad value for localhost, that will create a bad socket. That said, I don't know how to reproduce the original issue -- maybe it isn't an issue for some other reason.

Given that none of the things touched by this patch is extremely critical (GDBRemoteCommunication::ConnectLocally is used for reproducers and the rest is test code), maybe we could use this to test the water and see whether we can start using the network stack the way it's supposed to be used ?

Many non-reproducers tests fail w/o the change to GDBRemoteCommunication::ConnectLocally, so it's at least used outside of that.

I can go either way with using localhost or using hard-coded localhost IPs. Whichever way we go, I have a slight preference for keeping the tests and code consistent, e.g. if we use hard-coded IPs in code linked into lldb to allow people with a bad hosts file to use lldb, we should use hard-coded IPs in the tests to allow those same users to develop on lldb.

Switched back to the original approach of using localhost when possible [1], though I still have the git branch so I can always switch back. I don't have a personal preference either way, but would like to land one of the two approaches as this creates a strange amount of test flakiness internally when we need to support ipv4.

[1] IIUC, when connecting via hostname, attempting "localhost" will try both AF_INET and AF_INET6, but there is no corresponding way to construct a raw socket with automatic fallback to AF_INET6 if AF_INET does not work, so that part is the same as before (try to create AF_INET and catch the error manually)

I am fine with trying "localhost" and seeing if we run into any issues. Hopefully slow DNS isn't a problem anymore on Macs. Everyone else ok?

labath accepted this revision.Sep 30 2020, 7:11 AM

Sorry about the delay.

I am fine with trying "localhost" and seeing if we run into any issues. Hopefully slow DNS isn't a problem anymore on Macs. Everyone else ok?

Sounds good to me. I'm mainly hoping that with the use of socketpair(2), the DNS will not be on the critical path anymore.

Many non-reproducers tests fail w/o the change to GDBRemoteCommunication::ConnectLocally, so it's at least used outside of that.

Well... it is used in a bunch of tests exercising the socket functionality, but these are just tests. The only production usage is in reproducers. Which makes total sense when you think about it -- why would anyone be creating a socket for communicating within the same process -- there are much simpler ways to do that.

[1] IIUC, when connecting via hostname, attempting "localhost" will try both AF_INET and AF_INET6, but there is no corresponding way to construct a raw socket with automatic fallback to AF_INET6 if AF_INET does not work, so that part is the same as before (try to create AF_INET and catch the error manually)

The fully generic way to handle this would be to loop over getaddrinfo("localhost") and try to listen on each address that returns (multiple sockets). That is at least for the case when you don't know who is going to connect to you. For tests, it should be sufficient to pick a first address that works and use that.

This revision is now accepted and ready to land.Sep 30 2020, 7:11 AM
This revision was landed with ongoing or failed builds.Sep 30 2020, 11:11 AM
This revision was automatically updated to reflect the committed changes.