This is an archive of the discontinued LLVM Phabricator instance.

[VFS] Drop path traversal assertion
ClosedPublic

Authored by bruno on Feb 10 2016, 2:43 PM.

Details

Summary

This patch removes the path traversals check/assertion related with the presence of ".." in paths.

The rationale is that if source and destination paths in the YAML file contain ".." this is enough
for the file manager to retrieve the right file, meaning that it doesn't matter how we write it
since the FileManager is capable of transforming it in real/feasible paths.

This is really common and has been working unnoticed in non-assert builds for while; example of an
entry like this in the VFS YAML file follow:

{

'type': 'directory',
'name': "/llvm-install/bin/../lib/clang/3.8.0/include",
'contents': [
  {
    'type': 'file',
    'name': "__stddef_max_align_t.h",
    'external-contents': "<whatever_path_to_cache>/vfs/llvm-install/bin/../lib/clang/3.8.0/include/__stddef_max_align_t.h"
  },
...

Add a test to cover this up.

Diff Detail

Event Timeline

bruno updated this revision to Diff 47532.Feb 10 2016, 2:43 PM
bruno retitled this revision from to [VFS] Drop path traversal assertion.
bruno updated this object.
bruno added reviewers: bogner, benlangmuir.
bruno added subscribers: cfe-commits, dexonsmith.
benlangmuir edited edge metadata.Feb 10 2016, 2:53 PM

Please clarify what you mean here:

The rationale is that if source and destination paths in the YAML file contain ".." this is enough
for the file manager to retrieve the right file, meaning that it doesn't matter how we write it
since the FileManager is capable of transforming it in real/feasible paths.

The VFS layer does not have access to the FileManager. I agree that ".." in a destination path should be fine. But ".." in a source path will only work if we have code in the redirecting file system to handle ".." explicitly, which AFAICT we don't:

RedirectingFileSystem::lookupPath(sys::path::const_iterator Start,
                                  sys::path::const_iterator End, Entry *From) {
  if (Start->equals("."))
    ++Start;

  // FIXME: handle ..
bruno added a comment.Feb 10 2016, 3:04 PM

Please clarify what you mean here:

The rationale is that if source and destination paths in the YAML file contain ".." this is enough
for the file manager to retrieve the right file, meaning that it doesn't matter how we write it
since the FileManager is capable of transforming it in real/feasible paths.

The VFS layer does not have access to the FileManager. I agree that ".." in a destination path should be fine. But ".." in a source path will only work if we have code in the redirecting file system to handle ".." explicitly, which AFAICT we don't:

RedirectingFileSystem::lookupPath(sys::path::const_iterator Start,
                                  sys::path::const_iterator End, Entry *From) {
  if (Start->equals("."))
    ++Start;

  // FIXME: handle ..

What I mean here is that if you ask the FileManager for "/llvm-install/bin/../lib/clang/3.8.0/include/__stddef_max_align_t.h", it will successfully invoke VFS code and find "<whatever_path_to_cache>/vfs/llvm-install/bin/../lib/clang/3.8.0/include/__stddef_max_align_t.h". Since ".." is split into its own path component, it will compare ".." in Start and From, succeed and the search will continue. Probably this traversal code was not initially meant to work this way, but the actual fact is that it does.

The test in the patch tests exactly that.

I don't think this is the right approach. If we don't canonicalize the source path then:

  • looking up the path *without* the .. won't work, which means anything that looks up a realpath will fail
  • directory iteration won't combine entries from foo/bar/.. and foo/

I think we're better off handling ".." in lookup and always using canonicalized paths in the YAML representation.

I don't think this is the right approach. If we don't canonicalize the source path then:

  • looking up the path *without* the .. won't work, which means anything that looks up a realpath will fail

If we canonicalize, then it needs to be done in two places:

  1. In the module dependence collector while collecting header files, which will then collect the real path.

Doing it at this point guarantees that we will only write out canonicalized paths to the YAML file.

  1. While requesting paths from the VFS overlay, we must first canonicalize

the path in

RedirectingFileSystem::lookupPath(const Twine &Path_)

before calling out

RedirectingFileSystem::lookupPath(sys::path::const_iterator Start,
                                  sys::path::const_iterator End, Entry *From)

One way to make this work is to use llvm::sys::path::remove_dots(Path, true) in (1) and (2).
That said, instead of this YAML output:

{

'type': 'directory',
'name': "/install-dir/bin/../lib/clang/3.8.0/include",
'contents': [
  ...
  {
    'type': 'file',
    'name': "altivec.h",
    'external-contents': "<path_to_cache>/install-dir/bin/../lib/clang/3.8.0/include/altivec.h"
  },
  ...

We would have:

{

'type': 'directory',
'name': "/install-dir/lib/clang/3.8.0/include",
'contents': [
  ...
  {
    'type': 'file',
    'name': "altivec.h",
    'external-contents': "<path_to_cache>/install-dir/lib/clang/3.8.0/include/altivec.h"
  },
  ...

Therefore, while using the new YAML file, whenever the FileManager requests status in (2) for the path "/install-dir/bin/../lib/clang/3.8.0/include/altivec.h", it first call llvm::sys::path::remove_dots(Path, true), which will give back "/install-dir/lib/clang/3.8.0/include/altivec.h", lookupPath is going to be able to match this entry from the YAML file and success!

However, if in the original filesystem "bin" is a symlink, in (1) we would have to use "realpath()" instead of "remove_dots()", let's say it yields "/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h", the YAML will look like:

{

'type': 'directory',
'name': "/private/tmp/a/b/lib/clang/3.8.0/include",
'contents': [
  ...
  {
    'type': 'file',
    'name': "altivec.h",
    'external-contents': "<path_to_cache>/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h"
  },
  ...

Now, we try to use the new YAML file + cache from a new clang invocation. When the FileManager requests status in (2) for the path "/install-dir/bin/../lib/clang/3.8.0/include/altivec.h" neither "remove_dots()" nor "realpath()" would be able to resolve it to "/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h" and will fallback to search the real filesystem instead, if this is a different machine, it will fail. Since we are looking at a virtual path in (2) it could makes sense to use remove_dots() but it won't make sense to use "realpath()".

How do you propose we fix that?

Another issue is that "realpath()" is expensive, but I guess we could only use it whenever the path contains ".." components and cache the base directory for speeding up translations. Another solution is to generate two entries in the YAML file:

{

'type': 'directory',
'name': "/install-dir/bin/../lib/clang/3.8.0/include",
'contents': [
  ...
  {
    'type': 'file',
    'name': "altivec.h",
    'external-contents': "<path_to_cache>/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h"
  },
'type': 'directory',
'name': "/private/tmp/a/b/lib/clang/3.8.0/include",
'contents': [
  ...
  {
    'type': 'file',
    'name': "altivec.h",
    'external-contents': "<path_to_cache>/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h"
  },
  ...
  • directory iteration won't combine entries from foo/bar/.. and foo/

Yes, but if "bar" is a symbolic link we have another set of problems, like mentioned above.

I think we're better off handling ".." in lookup and always using canonicalized paths in the YAML representation.

Makes sense, we just need to come with a general solution for the symbolic link in path components.

Now, we try to use the new YAML file + cache from a new clang invocation. When the FileManager requests status in (2) for the path "/install-dir/bin/../lib/clang/3.8.0/include/altivec.h" neither "remove_dots()" nor "realpath()" would be able to resolve it to "/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h" and will fallback to search the real filesystem instead, if this is a different machine, it will fail. Since we are looking at a virtual path in (2) it could makes sense to use remove_dots() but it won't make sense to use "realpath()".

How do you propose we fix that?

Ugh, I forgot about this case. Sorry for making this sound simpler than it is!

Your two-path solution seems like it's on the right track but I don't like that it brings us back to having ".." in the source path. Do you think it's feasible to store a mapping from the realpath, and additionally have a symlink entry in the VFS (which would be a new feature for the VFS in general)? When we write the YAML file, we could realpath(source) and if it is different from remove_dots(source) we would add a symlink entry. I suspect figuring out whthe symlink would add non-trivial overhead, since we'd have to look at every path component... at least it would only be paid when creating the YAML, not when reading it. Any thoughts?

Your two-path solution seems like it's on the right track but I don't like that it brings us back to having ".." in the source path.

Given the two-path solution, another thing we could do in the first path, is to default to the remove_dots() version for the source, i.e. "/install-dir/lib/clang/3.8.0/include" instead of "/install-dir/bin/../lib/clang/3.8.0/include". The path after remove_dots() might be different from what realpath() would return (but this is covered in the 2nd path) but at least we canonicalize for a future virtual path request (which can safely request the VFS based on the returned path from remove_dots()). What do you think about it?

Do you think it's feasible to store a mapping from the realpath, and additionally have a symlink entry in the VFS (which would be a new feature for the VFS in general)?
When we write the YAML file, we could realpath(source) and if it is different from remove_dots(source) we would add a symlink entry.

Correct me if I'm wrong, but do you suggest we have a new entry type in the YAML file for declaring symlinks? Right now, we handle symlinks by copying them out as real files inside the VFS. However, adding extra "name" entries to map for real paths for such symlinks inside the YAML should do it.

I suspect figuring out whthe symlink would add non-trivial overhead, since we'd have to look at every path component... at least it would only be paid when creating the YAML, not when reading it. Any thoughts?

I agree that the price should only be paid only when creating the YAML files.

Given the two-path solution, another thing we could do in the first path, is to default to the remove_dots() version for the source, i.e. "/install-dir/lib/clang/3.8.0/include" instead of "/install-dir/bin/../lib/clang/3.8.0/include". The path after remove_dots() might be different from what realpath() would return (but this is covered in the 2nd path) but at least we canonicalize for a future virtual path request (which can safely request the VFS based on the returned path from remove_dots()). What do you think about it?

I'm not sure I understand how the 2nd path would cover the realpath case. It is just an entry for the realpath itself; that doesn't help if the client looks up "/install-dir/lib/clang/3.8.0" directly (not just from removing dots from bin/..).

Correct me if I'm wrong, but do you suggest we have a new entry type in the YAML file for declaring symlinks? Right now, we handle symlinks by copying them out as real files inside the VFS. However, adding extra "name" entries to map for real paths for such symlinks inside the YAML should do it.

Yes, that's what I'm suggesting.

I'm not sure I understand how the 2nd path would cover the realpath case. It is just an entry for the realpath itself; that doesn't help if the client looks up "/install-dir/lib/clang/3.8.0" directly (not just from removing dots from bin/..).

What I was suggestting is this:

'type': 'directory',
'name': "/install-dir/lib/clang/3.8.0/include",
'contents': [

...
{
  'type': 'file',
  'name': "altivec.h",
  'external-contents': "<path_to_cache>/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h"
},

'type': 'directory',
'name': "/private/tmp/a/b/lib/clang/3.8.0/include",
'contents': [

...
{
  'type': 'file',
  'name': "altivec.h",
  'external-contents': "<path_to_cache>/private/tmp/a/b/lib/clang/3.8.0/include/altivec.h"
},
...

The second case is here to address what you mentioned in the first comment "looking up the path *without* the .. won't work, which means anything that looks up a realpath will fail". But thinking more about it, I don't see how it's useful to have this second entry since the VFS will only be requested for "/install-dir/bin/../lib/clang/3.8.0/include" anyway. What about we stick to 1st path? This should solve the ".." issue and at the same time synchronise with the virtual path lookup if we use remove_dots() over there too. Am I missing something here?

Correct me if I'm wrong, but do you suggest we have a new entry type in the YAML file for declaring symlinks? Right now, we handle symlinks by copying them out as real files inside the VFS. However, adding extra "name" entries to map for real paths for such symlinks inside the YAML should do it.

Yes, that's what I'm suggesting.

Is there any real reason for this additional logic? Why an regular additional entry in the YAML file isn't enough?
BTW, there are two cases for symlinks:

(A) The above where the symlink is a path component. Unless I missed something, this could be solved with only one path entry, like discussed above.
(B) The symlink is the final header itself, this could be solved without any extra logic by duplicating the entries while maintaining only one copy in the .cache/vfs dir, example:

Take the symlink:
/usr/include/pthread.h -> pthread/pthread.h

'type': 'directory',
'name': "/usr/include/",
'contents': [

...
{
  'type': 'file',
  'name': "pthread.h",
  'external-contents': "<path_to_cache>/usr/include/pthread/pthread.h"
},

...
'type': 'directory',
'name': "/usr/include/pthread",
'contents': [

...
{
  'type': 'file',
  'name': "pthread.h",
  'external-contents': "<path_to_cache>/usr/include/pthread/pthread.h"
},
...

Okay, let's go with the two-path solution. Thanks!

bruno updated this revision to Diff 48397.Feb 18 2016, 1:36 PM
bruno edited edge metadata.

Updated the patch, with the following changes:

Handle ".", ".." and "./" with trailing slashes while collecting files
to be dumped into the vfs overlay directory.

Include the support for symlinks into components. Given the path:

/install-dir/bin/../lib/clang/3.8.0/include/altivec.h, if "bin"
component is a symlink, it's not safe to use `path::remove_dots` here,
and `realpath` is used to get the right answer. Since `realpath`
is expensive, we only do it at collecting time (which only happens
during the crash reproducer) and cache the base directory for fast lookups.

Overall, this patch makes the input to the VFS YAML file to be canonicalized
to never contain traversal components.
benlangmuir accepted this revision.Feb 19 2016, 6:00 PM
benlangmuir edited edge metadata.

LGTM if you fix 256->PATH_MAX.

lib/Frontend/ModuleDependencyCollector.cpp
66

Please use PATH_MAX.

126–128

How about "HasRemovedSymlinkComponent" to make it clear this is only for symlinks followed by ".."?

test/Modules/crash-vfs-path-symlink-component.m
16

Maybe add indentation after the RUN: to make it easier to see that the command is wrapped?

This revision is now accepted and ready to land.Feb 19 2016, 6:00 PM
bruno closed this revision.Feb 22 2016, 11:11 AM

Committed r261551

Excuse me, I have reverted it in r263636.

lib/Frontend/ModuleDependencyCollector.cpp
65

"clang/Config/config.h" doesn't have it and "llvm/Config/config.h" should be unavailable in clang tree.

I suggest you may do;

  1. Move it to LLVMSupport.
  2. Detect HAVE_REALPATH in clang's cmakefiles.
  3. Export HAVE_REALPATH into llvm-config.h.