This is an archive of the discontinued LLVM Phabricator instance.

Always normalize FileSpec paths.
ClosedPublic

Authored by clayborg on Apr 23 2018, 11:38 AM.

Details

Summary

Always normalizing lldb_private::FileSpec paths will help us get a consistent results from comparisons when setting breakpoints and when looking for source files. This also removes a lot of complexity from the comparison routines. Modified the DWARF line table parser to use the normalized compile unit directory if needed.

Diff Detail

Event Timeline

clayborg created this revision.Apr 23 2018, 11:38 AM
clayborg edited reviewers, added: aprantl; removed: pranavb.Apr 23 2018, 11:38 AM
clayborg updated this revision to Diff 143608.Apr 23 2018, 11:40 AM

Remove unwanted file changes.

aprantl added inline comments.Apr 23 2018, 11:41 AM
lldb.xcodeproj/xcshareddata/xcschemes/darwin-debug.xcscheme
29 ↗(On Diff #143605)

Is this an accidental change or is it relevant to the patch?

lldb.xcodeproj/xcshareddata/xcschemes/desktop.xcscheme
68 ↗(On Diff #143605)

This looks like an accidental commit?

clayborg updated this revision to Diff 143609.Apr 23 2018, 11:45 AM

Remove commented out code.

aprantl added inline comments.Apr 23 2018, 11:46 AM
source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
189

// slashes and other path ano*m*alies before we use it for path prepending*.*

source/Utility/FileSpec.cpp
105

Are we sure that we need this? Could we use llvm::sys::path::has_relative_path & friends instead? http://llvm.org/doxygen/namespacellvm_1_1sys_1_1path.html

107

shouldn't this be a more general path separator (we probably have something like this in Support/Path.h)

clayborg added inline comments.Apr 23 2018, 2:20 PM
source/Utility/FileSpec.cpp
105

I only want to scan the string one time. This will return true if this contains relative redundancies or extra slashes, not if the path is relative. "./foo" doesn't need normalization and this will return false, but "foo/." would need to be normalized to "foo", as well as "foo/./bar" and "foo/../bar". So this function is doing something completely different than the LLVM counterparts

107

no as the only caller to this function switches the slashes to '/' already..

clayborg updated this revision to Diff 143643.Apr 23 2018, 2:24 PM

Fix comment typo

clayborg marked an inline comment as done.Apr 23 2018, 2:25 PM
aprantl added inline comments.Apr 23 2018, 2:36 PM
source/Utility/FileSpec.cpp
107

If we change that to also replace double-/ with singles, we could replace this function with a call to llvm::sys::path::remove_dots()

http://llvm.org/doxygen/namespacellvm_1_1sys_1_1path.html#a35c103b5fb70a66a1cb5da3b56f588a1

115

Does this also turn "//WORKGROUP/Foo" into "/WORKGROUP/Foo/"?

One general question: why is this form of normalization preferred over calling realpath?

clayborg added inline comments.Apr 23 2018, 3:08 PM
source/Utility/FileSpec.cpp
107

I am fine switching to using the llvm functions for removing, this is only detecting if any normalization needs to happen. If there is an equivalent LLVM function that will only run through the string one time to detect all needs for normalization (no multiple passes looking for "..", then for "." etc like we used to have), I would be happy to use it, but there doesn't seem to be. We are trying to avoid having to create a "SmallVectorImpl< char >" for all paths if they don't need fixing here. This function is just iterating through an llvm::StringRef just to see if normalization needs to happen. If it does, then we make a SmallVectorImpl< char > and we fix the path. We can easily use llvm functions for the fixing part.

115

No it does not,. Notice the "if (i > 0)" below. We need to keep leading "//"

One general question: why is this form of normalization preferred over calling realpath?

Normalization is everything we can do to fix up a path without knowing anything about the current working directory or any symlinks. So we can remove redundant references to the current directory (".") or the parent directory (".."), but only if they are not at the start of the path. Since our path may have been created on a different machine with an unknown symlinks, realpath really can't do anything for us. What is a path like "./foo.txt" relative to when stored in the debug info with no compilation directory? what if we have "/tmp/mysymlink" mean on another machine? We can't realpath it because we don't know the actual root or any of the symlinks.

aprantl added inline comments.Apr 23 2018, 3:39 PM
source/Utility/FileSpec.cpp
107

I see, you want to avoid copying the string when it isn't necessary to do so. IMHO the best way to do this is to add a bool hasDots(const Twine &) function to llvm::sys::path and add a bool remove_dot_dot parameter to removeDots(). Since you have already written the unittests, adding the function to LLVM shouldn't be any extra work (I'll help reviewing), and we'll have 100 LoC less to maintain.

This way we can avoid having two functions that perform path normalization that superficially look like they might do the same thing, but a year from now nobody can really tell whether they behave exactly the same, and whether a bug found in one implementation should also be fixed in the other one, etc... .

clayborg added inline comments.Apr 23 2018, 3:49 PM
source/Utility/FileSpec.cpp
107

I want to know if there are redundant slashes as well. I want something like "bool llvm::sys::has(const Twine &, bool dots, bool dot_dots, bool redundant_slashes)". But I don't see that getting accepted? I just want a "bool llvm:sys::path::contains_redundant_stuff(const Twine &)". Not sure on the name though. needs_normalization? can_be_shortened? Everything in LLVM right now assumes that dots, dot dots, slashes and realpath stuff happen in the same function. Not sure how to break that up without ruining the performance of the one and only loop that should be happening. I like the current approach since it doesn't require chopping up the string into an array and iterating through the array.

clayborg added inline comments.Apr 23 2018, 3:55 PM
source/Utility/FileSpec.cpp
107

Also not sure if the LLVM stuff will try to substitute in the current working directory for "."?

aprantl added inline comments.Apr 23 2018, 4:18 PM
source/Utility/FileSpec.cpp
107

I just read the source code of remove_dots() (http://llvm.org/doxygen/Path_8cpp_source.html#l00699) and if I'm not mistaken, it actually also removes double-separators as a side-effect, since it iterates over the path components of the string as it constructs the copy. It also seems to avoid the copy operation if it isn't necessary.

Could you take another look? Perhaps I'm missing something, (or perhaps we can just turn this into a small addition to remove_dots).

Everything in LLVM right now assumes that dots, dot dots, slashes and realpath stuff happen in the same function.

I'm not sure I understand.

Also not sure if the LLVM stuff will try to substitute in the current working directory for "."?

No it won't, remove_dots() has no side effects. There's a separate make_absolute() function. That one will cause another copy. Personally, if I have a choice between a redundant string copy and maintaining more code, I'd pick the string copy unless we know that it really matters.

I'll take that back: make_absolute only copies when the path isn't already absolute.

labath added inline comments.Apr 24 2018, 3:15 AM
source/Breakpoint/BreakpointResolverFileLine.cpp
130–131

What about paths like .foo/bar.c and .../bar.c. These don't contain . or .. components. so you'll want to avoid the consuming here. (You can use *llvm::sys::path::begin(path) to get the first path component and check that).

Also, I'm not sure what is the behavior we want for paths like ../../foo.cpp. The present behavior of matching the path as it was /../foo.cpp does not seem entirely useful.

source/Utility/FileSpec.cpp
107

FWIW, this could be implemented in a much simpler way, while still making sure we run through the string only once. I'm thinking of something like:

bool first = true;
for(it = llvm::sys::path::begin(path, style), end = llvm::sys::path::end(path); it != end; ++it) {
  if (!first && (*it == "." || *it == "..")) return true;
  first = false;
}
return false;
184–187

This looks wrong, because then for a path like ./../whatever, you will end up with both . and .. in the "normalized" path.

401–406

This is an interesting edge case, but I'm not sure if we actually want to be doing this. I think we should make an effort to formally define what do we expect from the normalized path. After the lldb-dev discussion on ./foo.cpp I have formed a tentative definition in my head, and this does not seem consistent with that.

I'll try to propose one definition:

Def: Two FileSpecs are equivalent iff:

  • they resolve to the same file in a virtual filesystem, where all referenced path components exist and are directories
  • the current working directory is deep enough so that each .. component resolves to a different directory than its predecessor.
  • both of them have at least two path components or both of them have just one path component.

Def: The normalized form of a FileSpec is the equivalent FileSpec with the least number path components. In case of ties, we choose the one with the least number of .. components.

Some explanations about how I arrived at this:
a) the "all paths exist and are directories" part is there so we can consider foo/../bar.cpp and ./bar.cpp equivalent. (Because, in general, this will not be true if foo does not exists, or is a regular file, or a symlink, etc.)
b) the "cwd is deep enough" part is there to avoid collapsing excessive leading ".." components, because / and /.. are the same file
c) the "at least two path components" is there to avoid collapsing ./foo.cpp into foo.cpp as we want to preserve the information that the file contained some kind of directory specification.
d) in the equivalence definition, the "least number of components" is there to make sure we remove all components that are removable and (hopefully) arrive at a single canonical representation.
e) the "tie" case is necessary to disambiguate between foo/.. and ./. as otherwise these two would be incomparable.

Now the (c) part would imply that foo/.. should normalize to ./. and not ., which is a bit odd, but it is consistent with our stated intention of preserving directory information. If we do not want to have ./. here, then we need to come up with a different definition of what it means to be "normalized".

857

Why is this necessary? result can't be empty because we have just appended something to it (and we've checked that components[i] is non-empty).

unittests/Utility/FileSpecTest.cpp
150–153

You've removed the remove_backup_dots argument from the FileSpec::Equal, which removes two of the combinations we need to test here. However, I think the other other two combination still deserve some testing. As it stands now, this completely removes the test coverage of the Equal function.

I am trying to switch to using llvm::sys::path::remove_dots() and we will see where we end up. By switching to this it means:

"./foo.c" --> "foo.c"
"./foo/bar.c" --> "foo/bar.c"

This makes is easier to see if a relative path matches another FileSpec since we don't have to remove the "./" from the beginning of the path.

source/Breakpoint/BreakpointResolverFileLine.cpp
131

I will be removing this after I switch to using llvm::sys::path::remove_dots() instead of the Normalize() I converted.

source/Utility/FileSpec.cpp
107

This will all go away, I am just going to call llvm::sys::path::remove_dots()...

187

Yes we would need to remove this. Again, I will switch to using llvm::sys::path::remove_dots and we can deal with the fallout there,.

406

they resolve to the same file in a virtual filesystem, where all referenced path components exist and are directories

Normalizing happens after resolving and if we don't resolve a path, we have no way to know what is a directory and what isn't. We will be setting breakpoints for remote targets quite a bit in LLDB and ww can't assume or stat anything in a path. So I would say FileSpec's are equivalent if the relative paths match all components.

Now the (c) part would imply that foo/.. should normalize to ./. and not ., which is a bit odd, but it is consistent with our stated intention of preserving directory information. If we do not want to have ./. here, then we need to come up with a different definition of what it means to be "normalized".

I guess we could make the m_directory contain "." and m_filename contain nothing for the "./." case. It doesn 't make sense to have "." in both places.

857

Because if you don't check this joining {"/', "foo.txt"} will result in "//foo.txt" which is wrong,.

labath added inline comments.Apr 24 2018, 8:27 AM
source/Utility/FileSpec.cpp
406

they resolve to the same file in a virtual filesystem, where all referenced path components exist and are directories

Normalizing happens after resolving and if we don't resolve a path, we have no way to know what is a directory and what isn't. We will be setting breakpoints for remote targets quite a bit in LLDB and ww can't assume or stat anything in a path.

Yes, I am aware of that. I am not saying this is how we should actually implement the normalization algorithm. I am trying define what a "normalization" is in the first place, so that we can then judge whether a particular normalization algorithm is good or not. I think defining normalization in terms of an actual filesystem makes sense, since at the end of the day, our algorithm should somehow approximate what happens in real file systems. I am not saying the algorithm should be doing any stats, but for the verification (either in our heads or in the tests) we can use certainly use stats or actual file systems.

So I would say FileSpec's are equivalent if the relative paths match all components.

This is too vague to be useful. I have no idea how I would apply this definition to determine if e.g. "/foo/../bar.txt" and "./bar.txt" are equivalent. And you didn't say anything about how to derive the normal form for a FileSpec.

Now the (c) part would imply that foo/.. should normalize to ./. and not ., which is a bit odd, but it is consistent with our stated intention of preserving directory information. If we do not want to have ./. here, then we need to come up with a different definition of what it means to be "normalized".

I guess we could make the m_directory contain "." and m_filename contain nothing for the "./." case. It doesn 't make sense to have "." in both places.

I don't think that is very useful, as then this would be the only special case where normalization would produce a FileSpec without a filename component.

857

Yes, but didn't the original condition guard against that already?

We know that components[i] is non-empty, and we have just appended it to result two lines above. So, unless I am missing something, result.back() should be the same as components[i].back() and this additional check does not buy us anything.

clayborg added inline comments.Apr 24 2018, 8:35 AM
source/Utility/FileSpec.cpp
406

So I would say FileSpec's are equivalent if the relative paths match all components.

This is too vague to be useful. I have no idea how I would apply this definition to determine if e.g. "/foo/../bar.txt" and "./bar.txt" are equivalent. And you didn't say anything about how to derive the normal form for a FileSpec.

"/foo/../bar.txt" would be normalized to "/bar.txt" and "./bar.txt" will, after switching to llvm's remove_dots, be normalized to "bar.txt". So those could be though of as equivalent since one is only a basename and would only need to match the filename. If you have "/foo/bar/baz.txt", it could be equivalent to "bar/baz.txt" by making sure the filename's match, and if either or both path is relative, then matching as many directories as are specified.

I guess we could make the m_directory contain "." and m_filename contain nothing for the "./." case. It doesn 't make sense to have "." in both places.

I don't think that is very useful, as then this would be the only special case where normalization would produce a FileSpec without a filename component.

Ok, then leave it as is with "." in filename, not directory. We don't need it in both places IMHO

857

I did not. During testing I found this case and "/" + "main.c" was producing "//main.c"

labath added inline comments.Apr 24 2018, 9:22 AM
source/Utility/FileSpec.cpp
406

Ok, if we start using llvm's remove_dots as our normalization algorithm, then both of these issues will become moot (and I believe I have a fairly good understanding of how remove_dots works).

857

I've just tried calling join_path_components(ePathSyntaxPosix, {"/", "main.c"}), and it produced /main.c even without your modifications.

clayborg updated this revision to Diff 143777.Apr 24 2018, 10:02 AM

Switch over to using llvm::sys::path::remove_dots(), remove the ::Normalize() function and fix a few issue discovered during testing.

aprantl added inline comments.Apr 24 2018, 12:48 PM
source/Utility/FileSpec.cpp
379

// Normalize the path by removing './' and other redundant components.

380

Thanks!

clayborg updated this revision to Diff 143801.Apr 24 2018, 1:23 PM

Added comment before llvm::sys::path_remove_dots(...)

This code itself looks fine, I have just two minor comments.

However, I do have a question about performance. I remember us being very worried about performance in the past, so we ended up putting in this like r298876. This removes the normalization step during FileSpec comparison, but it introduces mandatory normalization upon every FileSpec creation, so it's not obvious to me what will this do to performance. Should we try to benchmark this somehow?

Jim, you seem to have encountered a case when the normalization was a bottleneck (r298876). Do you remember what situation was that in?

source/Utility/FileSpec.cpp
382

It looks like this is unused.

401–402

remove_dots should never produce a path like this, so we should be able to revert this now.

This code itself looks fine, I have just two minor comments.

However, I do have a question about performance. I remember us being very worried about performance in the past, so we ended up putting in this like r298876. This removes the normalization step during FileSpec comparison, but it introduces mandatory normalization upon every FileSpec creation, so it's not obvious to me what will this do to performance. Should we try to benchmark this somehow?

Ok, so I ran a benchmark and we are about 7% slower on a completely cold file cache, and 9% slower on a warm file cache. If I add the needsNormalization() function back in, we are 7% faster for cold, and 10% faster for warm. So I will add the needsNormalization() back in.

clayborg updated this revision to Diff 144021.Apr 25 2018, 3:15 PM

After doing performance tests, the code was 7 to 10 % slower if we didn't check if a path needs normalization due to the llvm code making arrays of StringRef objects and appending a path together. Restored and even improved performance after adding back the needsNormalization() function that quickly checks if a path needs normalization and avoids the splitting of the path and reconstruction of the path if it isn't needed.

labath accepted this revision.Apr 26 2018, 8:53 AM

Looks fine to me. Normalization, at least as it is implemented now in remove_dots, is a fairly heavy operation, so it makes sense to avoid it when possible. And the extra speedup is great.

source/Utility/FileSpec.cpp
341

these can be just static functions. No need for a namespace.

unittests/Utility/FileSpecTest.cpp
235–236

Neither the present behavior, nor c:\ is correct here. c:..\.. is a path relative to the current directory on the C drive, C:\ is the root of the C drive, and c:\..\.. is equivalent to c:\. However, it seems nobody cares about corner cases like this, as there are bunch of them in the llvm path handling.

This revision is now accepted and ready to land.Apr 26 2018, 8:53 AM
This revision was automatically updated to reflect the committed changes.