This is an archive of the discontinued LLVM Phabricator instance.

Skip symlinks to the original file when searching for debug info
ClosedPublic

Authored by labath on Feb 23 2015, 10:54 AM.

Download Raw Diff

Details

Reviewers

emaste
zturner

Commits

rG7257d2802527: Skip symlinks to the original file when searching for debug info
rLLDB230476: Skip symlinks to the original file when searching for debug info
rL230476: Skip symlinks to the original file when searching for debug info

Summary

Symbols::LocateExecutableSymbolFile tries to locate the file in containing the debug info in a
splitdebug configuration. It tries to skip over the original file in its search path, but it was
easily fooled by symlinks. This changes the function to use llvm::sys::fs::equivalent, which can
correctly compare symlinks.

Diff Detail

Repository: rL LLVM

Event Timeline

labath updated this revision to Diff 20523.Feb 23 2015, 10:54 AM

labath retitled this revision from to Skip symlinks to the original file when searching for debug info.

labath updated this object.

labath edited the test plan for this revision. (Show Details)

labath added reviewers: emaste, zturner.

labath added a subscriber: Unknown Object (MLST).

LGTM

Just a drive by comment: Would be good if you can explain in the commit message as to why the two changes in this patch are related.

Shouldn't this be a method on FileSpec. It would be fine to back it with the llvm function, but it's weird to have to use a mixture of FileSpec & lower-level llvm file system calls.

Jim

zturner added inline comments.Feb 23 2015, 12:23 PM

test/functionalities/unwind/noreturn/TestNoreturnUnwind.py
47–49 ↗	(On Diff #20523)	:( Really dislike seeing assumptions about mangling schemes hardcoded into string comparisons. Although admittedly, I don't think this is worse than before, and even worse, there's no good alternative. Mostly just a drive by comment. That said, this will detect any function that ends with abort. We don't have a perfect solution at this point, but checking against an explicit list of known manglings would generate fewer false positives. For example, the algorithm here would catch a function called "completely_unrelated_abort". The comment says some C libraries mangle into __GI_abort "or similar". Is the list of similar manglings long?

In D7836#128456, @jingham wrote:

Shouldn't this be a method on FileSpec. It would be fine to back it with the llvm function, but it's weird to have to use a mixture of FileSpec & lower-level llvm file system calls.

Jim

I'm actually a fan of the lower level file-system api calls because it makes it clear to the reader that host-specific code is going to execute. I actually wish FileSpec didn't even expose any host-specific operations, but it's hard to revisit that design now that it's in the public api. There's Host::FileSystem which could be a good place for it. But I kind of don't like the idea of FileSpec exposing a method that actually hits the host file system. It's kind of a fine line, I agree, but I just envision FileSpec as something that manipulates paths. This opens up the possibility of using FileSpecs to represent paths on remote hosts, for example.

That is not how we thought of FileSpec originally. It clearly was meant to deal with real files (for instance it has Exists methods & Read, etc.) It was meant to be a wrapper that would allow us to do all the useful file manipulation in a system-independent way. It ended up with system dependencies just because we didn't have the time to keep it clean when there were no ports of lldb that weren't Unix... But that's life, not design.

Anyway, it would be weird to have some object you use to denote the files you are interested in, but then when you actually want to do something with them, you have to turn around and use another object. And it would be unfortunate to have to write system specific code at this point except for the - hopefully very few - cases where the sort of thing we want to do to a file is not able to be expressed in a system-independent way.

Jim

In D7836#128511, @jingham wrote:

That is not how we thought of FileSpec originally. It clearly was meant to deal with real files (for instance it has Exists methods & Read, etc.) It was meant to be a wrapper that would allow us to do all the useful file manipulation in a system-independent way. It ended up with system dependencies just because we didn't have the time to keep it clean when there were no ports of lldb that weren't Unix... But that's life, not design.

Anyway, it would be weird to have some object you use to denote the files you are interested in, but then when you actually want to do something with them, you have to turn around and use another object. And it would be unfortunate to have to write system specific code at this point except for the - hopefully very few - cases where the sort of thing we want to do to a file is not able to be expressed in a system-independent way.

Jim

Well, already we have introduced the notion of the path syntax so that FileSpec can in theory refer to a path that is invalid for the host you're on. In that case it doesn't make much sense to have a Windows path, for example on Mac, and then try to do a file system operation on it. When I refactored some of Host months ago, there were other methods directly in Host that would do things like get file permissions, etc. So there was already some separation between the FileSpec and the methods that hit the file system. Most of that stuff ended up in FileSystem, and the stuff that remains in FileSpec is only there because I couldn't change the public API.

But with the notion of a FileSpec that can refer to a path that can't possibly be on your local host, I think it makes sense to consider whether the file system access routines should be separated from FileSpec.

Well I really like to keep ifdefs out of code. I know that Host is the
place for ifdefs, and FileSpec is in Host, but I still like to keep them to
a minimum. It's too easy to break things when you have to try to figure
out which ifdef paths you need to modify. That was the reason for creating
the FileSystem class. It exposes a platform agnostic interface to low
level FS operations. Similar to llvm::sys::fs, and there's probably some
duplication there that could be removed, but I just didn't want to break
too much stuff at the time.

I really wouldn't want to see a bunch of ifdefs ending up in FileSpec
(there's already some there that I'd like to see removed, as described
below). A lot of what it does is consistent across all platforms and is
just focused on splitting apart a string, and constifying the directory and
filename portions. And so I think it would be a shame to muck up this
(mostly) platform-agnostic class with some platform specific stuff. It
would be great if there were a way to isolate the disk access part. I
don't think it's too crazy to write something like

FileSpec MyFile("/usr/bin/foo");
FileSystem::DirectoryEnumerator Enumerator(MyFile);
for (FileSpec &Child : Enumerator) {

if (FileSystem::IsDirectory(Child)) {
    // Do something.
}

}

In fact I think it's even better.

It improves readability by making it clear that you're hitting the file

system, so you should be careful about what type of path you're using.

It improves modularity by isolating the code that hits the file system

from the code that doesn't.

It cleans up the FileSpec implementation somewhat. Currently

EnumerateDirectory is a big ugly function with a couple of different
preprocessor branches.

Separating the responsibilities like this leads to more flexible usage

patterns. Like having DirectoryEnumerator be an object that can be used in
a range-based for loop here, instead of forcing its usage to conform to
what you can write with a single function call.

Anyway, TL;DR of all this is that I know it was originally designed to be
all things related to files, but as it grows, it starts to become really
heavy. I think it makes sense to consider a separation of
responsibilities. Does a function that mmaps file contents really belong
in the same class as a function that resolves a symlink? I don't think we
need or even should have 1 class that is all things file system. We should
try to find ways to break it down into smaller, more targeted, more
meaningful components.

Bleh: This came across really ugly in the email Phab sent out due to me responding from email. Not sure if this will be any be better, but here it is again

Well I really like to keep ifdefs out of code. I know that Host is the place for ifdefs, and FileSpec is in Host, but I still like to keep them to a minimum. It's too easy to break things when you have to try to figure out which ifdef paths you need to modify. That was the reason for creating the FileSystem class. It exposes a platform agnostic interface to low level FS operations. Similar to llvm::sys::fs, and there's probably some duplication there that could be removed, but I just didn't want to break too much stuff at the time.

I really wouldn't want to see a bunch of ifdefs ending up in FileSpec (there's already some there that I'd like to see removed, as described below). A lot of what it does is consistent across all platforms and is just focused on splitting apart a string, and constifying the directory and filename portions. And so I think it would be a shame to muck up this (mostly) platform-agnostic class with some platform specific stuff. It would be great if there were a way to isolate the disk access part. I don't think it's too crazy to write something like

FileSpec MyFile("/usr/bin/foo");
FileSystem::DirectoryEnumerator Enumerator(MyFile);
for (FileSpec &Child : Enumerator) {
    if (FileSystem::IsDirectory(Child)) {
        // Do something.
    }
}

In fact I think it's even better.

It improves readability by making it clear that you're hitting the file system, so you should be careful about what type of path you're using.
It improves modularity by isolating the code that hits the file system from the code that doesn't.
It cleans up the FileSpec implementation somewhat. Currently EnumerateDirectory is a big ugly function with a couple of different preprocessor branches.
Separating the responsibilities like this leads to more flexible usage patterns. Like having DirectoryEnumerator be an object that can be used in a range-based for loop here, instead of forcing its usage to conform to what you can write with a single function call.

Anyway, TL;DR of all this is that I know it was originally designed to be all things related to files, but as it grows, it starts to become really heavy. I think it makes sense to consider a separation of responsibilities. Does a function that mmaps file contents really belong in the same class as a function that resolves a symlink? I don't think we need or even should have 1 class that is all things file system. We should try to find ways to break it down into smaller, more targeted, more meaningful components.

As an additional datapoint, the STL's filesystem library is designed in this way as well. Youv'e got std::path, and then operations under std::filesystem() which take a path object as one of the arguments.

Update commit message

I have added an explanation about the test to the commit message. As for where the equivalent call should go to, I very much like the idea of the separation of responsibilities, as in std::path and std::filesystem. However, this seems out of scope of this patch. I think the next best solution is to leave the bare call to equivalent in Symbols.cpp, as it is now. I don't think of it as system-dependent, as all of that should be hidden deep in llvm.

I'll wait for additional comments, and commit it in the present form if there is no pushback.

test/functionalities/unwind/noreturn/TestNoreturnUnwind.py
47–49 ↗	(On Diff #20523)	This is not actually "mangling" as the mangling of C++ names into strings. This is just libc declaring it's function as "__GI_abort" (a completely arbitrary string), and then linking the name "abort" to it using some elf symbol reference mechanism (or something, I don't know much about these things). But basically, since this is an arbitrary string and an implementation detail of the library, we would need to go through all c libraries and check how they resolve "abort". The list probably wouldn't be long, but it would never be definitive, as these things can change at any time. As I say in the commit message it would better to teach lldb to prefer public names for a symbol, but this is an unrelated issue, which this commit merely exposed. For a test, I think checking for endswith("abort") is fine as the danger of false positives is quite minimal.

Agree with all your comments here. Lgtm, but give others a chance to
comment more before committing

Closed by commit rL230476: Skip symlinks to the original file when searching for debug info (authored by labath). · Explain WhyFeb 25 2015, 2:46 AM

This revision was automatically updated to reflect the committed changes.

I think you should check for exactly "__GI_abort" on Linux and "abort" on other platforms.

That won't work. On linux this can resolve to "abort" or "__GI_abort", depending on whether you have debug symbols for libc or not. And if you use another C library it might resolve to something completely different.

There are a limited set of strings that can come out for each platform. I think that we should only accept those specific strings on those platforms. There are currently only two CRTs that we support so we should be able to get the list of supported strings by running the test twice?

That said, it's not a huge deal. Do whatever you think is appropriate.

I don't think that's a problem since the chances of a false positive in a test case of this size are really small.

That said, I think I have now dug around in symbol tables sufficiently long to be able to fix this to resolve to "abort" nicely. I'll try whipping up a patch in the near future.

Revision Contents

Path

Size

lldb/

trunk/

source/

Host/

common/

Symbols.cpp

2 lines

test/

functionalities/

unwind/

noreturn/

TestNoreturnUnwind.py

4 lines

Diff 20661

lldb/trunk/source/Host/common/Symbols.cpp

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	for (size_t idx = 0; idx < num_directories; ++idx)
files.push_back (dirname + module_directory + "/" + symbol_filename);		files.push_back (dirname + module_directory + "/" + symbol_filename);

const uint32_t num_files = files.size();		const uint32_t num_files = files.size();
for (size_t idx_file = 0; idx_file < num_files; ++idx_file)		for (size_t idx_file = 0; idx_file < num_files; ++idx_file)
{		{
const std::string &filename = files[idx_file];		const std::string &filename = files[idx_file];
FileSpec file_spec (filename.c_str(), true);		FileSpec file_spec (filename.c_str(), true);

if (file_spec == module_spec.GetFileSpec())		if (llvm::sys::fs::equivalent (file_spec.GetPath(), module_spec.GetFileSpec().GetPath()))
continue;		continue;

if (file_spec.Exists())		if (file_spec.Exists())
{		{
lldb_private::ModuleSpecList specs;		lldb_private::ModuleSpecList specs;
const size_t num_specs = ObjectFile::GetModuleSpecifications (file_spec, 0, 0, specs);		const size_t num_specs = ObjectFile::GetModuleSpecifications (file_spec, 0, 0, specs);
assert (num_specs <= 1 && "Symbol Vendor supports only a single architecture");		assert (num_specs <= 1 && "Symbol Vendor supports only a single architecture");
if (num_specs == 1)		if (num_specs == 1)
▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

lldb/trunk/test/functionalities/unwind/noreturn/TestNoreturnUnwind.py

Show All 38 Lines	def noreturn_unwind_tests (self):
if process.GetState() != lldb.eStateStopped:		if process.GetState() != lldb.eStateStopped:
self.fail("Process should be in the 'stopped' state, "		self.fail("Process should be in the 'stopped' state, "
"instead the actual state is: '%s'" %		"instead the actual state is: '%s'" %
lldbutil.state_type_to_str(process.GetState()))		lldbutil.state_type_to_str(process.GetState()))

thread = process.GetThreadAtIndex(0)		thread = process.GetThreadAtIndex(0)
abort_frame_number = 0		abort_frame_number = 0
for f in thread.frames:		for f in thread.frames:
if f.GetFunctionName() == "abort":		# We use endswith() to look for abort() since some C libraries mangle the symbol into
		# __GI_abort or similar.
		if f.GetFunctionName().endswith("abort"):
break		break
abort_frame_number = abort_frame_number + 1		abort_frame_number = abort_frame_number + 1

if self.TraceOn():		if self.TraceOn():
print "Backtrace once we're stopped:"		print "Backtrace once we're stopped:"
for f in thread.frames:		for f in thread.frames:
print " %d %s" % (f.GetFrameID(), f.GetFunctionName())		print " %d %s" % (f.GetFrameID(), f.GetFunctionName())

Show All 33 Lines