This is a follow up to 0be9ca7c0f9a733f846bb6bc4e8e36d46b518162 to make
paths in the case of falling back to the external file system use the
original format, preserving relative paths, and allow the external
filesystem to canonicalize them if needed.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Some of the other checks were invalid given the current tests, these seem to be enough to solve this issue without breaking existing tests
I think @vsapsai was looking at this as well.
This mostly LGTM, but I think this can be pretty easily unit tested. Please add something to the tests in llvm/unittests/Support/VirtualFileSystemTest.cpp. Also a nit inline.
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
2075–2076 | I suggest giving this a real name if it's being used for anything. Maybe PathAsWritten? |
Was looking at an issue caused by over-eager RedirectingFileSystem path canonicalization and this patch fixes it. The repro we have is more complicated (involves 3 modules), so I don't think it is worth including with this change.
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
2075–2076 | Another idea is to keep this variable Path and rename old Path to CanonicalPath. |
I'm sure I'm missing something, but after rereading the patch several times I still don't see the functional change. It just looks like it's renaming every instance of Path to CanonicalPath and Path_ to Path?
The functional change is not changing line 2016 and 2029 to use the newly named CanonicalPath when they did before. This was a bit more clear before I changed that variable name (but this is overall better IMO)
I think I might have to apply this same logic to the other functions changed in the original commit as well, this seems to solve this isolated case, but I think there are some more cases that aren't currently fixed by this
I'm realizing more now that these changes break the original intent of rG0be9ca7c0f9a733f846bb6bc4e8e36d46b518162 as well, given that it seems like the core canonicalization of the path before passing it down was the intent, but that's also the part that appears to be causing the issue. @JDevlieghere I would definitely appreciate some advice here, I'm still trying to see if I can satisfy the new test case without breaking the tests you changed, but I think they might fundamentally be at odds
Keith and I discussed this offline. My suggestion was to do the following:
- Check the overlay for the canonicalized path
- Check the fall-through for the canonicalized path
- Check the fall-through for the original path
If I understand correctly, this patch does that, but swaps 2 and 3. What's the motivation for trying the non-canonical path first? If we treat the current working directory as a property directory (which is what the canonicalization is all about), then it seems like we should exhaust those options first.
Using the canonical path first is the root of the problem, using that first the behavior to return the absolute path virtually always, for example in my new test these assertions
EXPECT_EQ("a", Name.get()); EXPECT_EQ("a", OpenedS->getName());
fail and would have to be converted to:
EXPECT_EQ("//root/foo/a", Name.get()); EXPECT_EQ("//root/foo/a", OpenedS->getName());
I'm not sure it's correct to do (3) at all...
I think the problem is that we've been conflating two separate things:
- The path used internally to find the filesystem object.
- The path reported back to the user.
I assert that the path used internally to find the filesystem object should always/only be the canonical path.
- Check the overlay for the canonicalized path.
- Check the fall-through for the canonicalized path.
But the path reported back to the user should almost always be the original path used for lookup. ONLY when the path was found in the overlay with "use-external-names" should it be modified at all.
I think this could be implemented by with the help of adding the following APIs to vfs::File:
class File { public: // Get the same file wrapped with a different reported path. static std::unique_ptr<File> getWithPath(std::unique_ptr<File> F, const Twine &P); protected: // API for mutating the path. Override to avoid needing a wrapper // object for File::getWithPath. virtual bool setPath(const Twine &Path) { return false; } };
then getWithPath can be:
namespace { class WrapperFile : public File { /* Change the observed path */ }; } std::unique_ptr<File> File::getWithPath(std::unique_ptr<File> F, const Twine &P) { if (F->setPath(P)) return F; return std::make_unique<WrapperFile>(F, P); }
Most the in-tree vfs::File derived classes can implement setPath so there shouldn't be many extra heap objects in practice.
@JDevlieghere / @keith, WDYT?
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
2075–2079 | With the new comment, I think it'd be useful to have SmallString versions of path the original and canonical paths. In which case: SmallString<256> Path; Path_.toVector(Path); SmallString<256> CanonicalPath = Path; (with the original Path_ for the parameter) seems cleanest? | |
2100–2104 | I think the correct logic here is something like: if (auto Result = ExternalFS->openFileForRead(CanonicalPath)) return Result->getPath() == CanonicalPath && Path != CanonicalPath ? File::getWithPath(*Result, Path) : *Result; else return Result.getError();
| |
2119–2121 | Can this logic be shared, with the above, maybe put in a lambda? | |
2130 | I think Path should be sent in here, not CanonicalPath. Only if use-external-names is set should the observed path be canonicalized here... and in that case, the argument will be ignored anyway. |
Yes, Keith and I came to the same conclusion yesterday. I was worried about tracking both paths at all times, but I like your suggestion of only changing the path when requested.
Another idea that could be useful would be to add a type:
struct CanonicalizedPath { StringRef OriginalPath; StringRef CanonicalPath; };
and add an overload for vfs::FileSystem::openFileForRead that takes this type.
- The default implementation in vfs::FileSystem could call openFileForRead(.Canonical) and then do a File::getWithPath(.Original) (as with my idea above, just when File::getPath matches .Canonical and .Canonical!=.Original).
- CanonicalizedPath-aware derived classes would invert the relationship. openFileForRead(FilesystemLookupPath) would use CanonicalPath for lookup, use OriginalPath for creating File, and pass through (a potentially updated!) CanonicalizedPath to base filesystems... and implement openFileForRead(Twine) by calling canonicalizePath.
Seems like a bit more code, but maybe not much, and the resulting logic might(?) be easier to reason about. (maybe a different name than openFileForRead would be better for clarity, rather than an overload?)
(Probably similar to your idea about tracking both paths at all times, but I'm not sure that needs to be mutually exclusive)
I might have missed that but what use case are we addressing by avoiding the usage of the original path for the fall-through file system? Because we have not only a problem with names reported back but also canonicalization can break symlinks with relative paths. For example, for the layout
|- packages | |- a | | `- include | `- b | `- include `- include -> packages/a/include
-I ./include/../../b/include doesn't work if you have VFS, even an empty one.
See rG0be9ca7c0f9a733f846bb6bc4e8e36d46b518162.
- RedirectingFileSystem has a virtual working directory.
- Prior to that commit, the external FS's working directory and RedirectingFileSystem's could get out of sync.
- Starting with that commit, RedirectingFileSystem consistently applies the virtual working directory.
Because we have not only a problem with names reported back but also canonicalization can break symlinks with relative paths.
Maybe this should be using makeAbsolute instead of makeCanonical (the latter calls the former)? (Note BTW that the absolute path logic is all pretty iffy on Windows, since each drive should have its own shadow virtual current directory, but that's got to be outside the scope here...)
In terms of canonicalization and symlinks, I think you're right that remove_dot_dot=true isn't really safe to use.
Ok I've updated the diff here based on @dexonsmith's original suggestion, and some offline discussion with @JDevlieghere
The new logic remaps the files and statuses, in the fallback to the external FS case, to use the originally requested path. I opted not to use the second suggestion with passing the struct containing both paths through because of the churn that would have on the API to maintain the current public signatures while also supporting that. This approach is analogous to how we're currently remapping the statuses for use-external-names as well. Please let me know what you think!
Note there's some churn from me renaming Path -> CanonicalPath and Path_ -> OriginalPath, let me know if you'd prefer I extract this into a separate diff that we land first without any logic changes.
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
1182–1183 | I'm pretty sure there was a reason we stopped doing this. There should be some discussion about that in my original patch. | |
2036 | Can we abstract this in a function similar to getRedirectedFileStatus, something like getOrginialFileStatus or getNonCanonicalizedFileStatus and have a comment explaining what it does and why? |
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
1182–1183 | So it sounds like it was related to this:
But if I remove that 2 of my new tests ReturnsInternalPathVFSHit and ReturnsExternalPathVFSHit do not pass. I think the behavior of them is what we want, what do you think? |
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
1182–1183 | We stopped doing this because it puts ExternalFS in an unknown state since Path may not exist there. Future calls with relative paths could do very strange things. E.g., here's a simple testcase that demonstrates something going very wrong:
The correct result is for the stat to fail because /2/a doesn't exist. But your change above will instead find /1/a in ExternalFS. Another example:
External FS will have CWD of /1, redirecting will have CWD of /, and stat a will erroneously give the result for /1/a instead of /a. (Probably it'd be good to add tests for cases like this...) To safely call ExternalFS->setCurrentWorkingDirectory(), you'll need extra state (and checks anywhere ExternalFS is used) tracking whether it has a valid working directory. If it does not, then it should not be queried, or it should only be sent absolute paths, or (etc.); and there should also be a way to re-establish a valid working directory. |
Remove cd of ExternalFS
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
1182–1183 | Ok I definitely understand the use case now. This is probably something we should add tests for I guess, since I didn't seem to break anything with all my changes here. I've updated the logic here, the core issue my new tests were failing without this is because the redirect from the VFS that is returned is not canonicalized itself, meaning when you asked for vfsname you would get vfsmappedname back, instead of //absolute/path/to/vfsmappedname. Since we stopped doing this cd, that wasn't enough. With my new change here we're now canonicalizing the return path as well, which is canonicalized against the working directory of the VFS itself. The one thing I'm unusure about here, is why this isn't being done by the values returned from the VFS instead. I've added a new test here VFSFromYAMLTest.RelativePathHitWithoutCWD that illustrates the behavior I'm talking about. Requesting the absolute file path still fails because my canonicalization of the remapped path is incorrect, and it should be based on the directory's root instead of the VFS's PWD. |
Fix test paths on windows
Otherwise this resulted in a json string with non-escaped backslashes.
llvm/lib/Support/VirtualFileSystem.cpp | ||
---|---|---|
1182–1183 |
It'd be awesome if you could do that if you're up for it, while your head is in this... (maybe in parallel with or after this patch?)
I'm sorry, I'm not quite following; not sure if I'm thinking slowly today or if it's just that this stuff is complicated :). Can you add an inline comment in context to the testcase in question (just on Phab) that explains which behaviour/lines of code/etc. are unexpected/why/etc? |
llvm/unittests/Support/VirtualFileSystemTest.cpp | ||
---|---|---|
1738 | So this test case is actually failing. The difference between it and the others is I don't call FS->setCurrentWorkingDirectory("//root/foo"). This results in us (with my most recent change here) performing this logic:
It seems to me, without having a ton of context here, that the value returned from the VFS lookup should actually be //root/foo/realname, since otherwise we could likely hit one of the same issues as those discussed above where if you actually had this situation:
You'd end up with the wrong file, but I don't think this is actually new behavior with my change? @dexonsmith wdyt? |
Remove broken test for now
Turns out this fails on main as well, so I'll punt this to another change
@dexonsmith turns out the test I was adding is broken on main today too. If it's alright with you I will punt that to another diff to not increase the scope of this one.
Yes, SGTM.
LGTM! Thanks for seeing this through. (And thanks for your patience; this (and the pings) just got buried somehow in my inbox.)
llvm/unittests/Support/VirtualFileSystemTest.cpp | ||
---|---|---|
1738 | I agree with your analysis of the correct behaviour. Until the first call to setCurrentWorkingDirectory(), it seems like the working directory should implicitly be the one in the underlying FS (not sure whether it should be captured on construction, or just used going forward). |
I'm pretty sure there was a reason we stopped doing this. There should be some discussion about that in my original patch.