See a full discussion of this patch starting at http://lists.llvm.org/pipermail/cfe-dev/2016-April/048298.html.
Details
Diff Detail
Event Timeline
lib/Support/Unix/Path.inc | ||
---|---|---|
522 | I'd go for an early return to reduce indentation. | |
524–547 | Please remember to remove code which is disabled if you check it in. | |
554–556 | This should be #ifdef __linux__ | |
557 | I wouldn't bother with this, it has a non-trivial amount of overhead on non-X86 platforms. I'd just fall-back to ::realpath if ::readlink failed. In fact, why not fall back to ::realpath regardless of platform? | |
560–561 | I think it'd be nicer to build a twine like so: SmallString<32> ProcPathStorage; StringRef ProcPath = Twine("/proc/self/fd/").concat(ResultFD).toNullTerminatedStringRef(ProcPathStorage); | |
563–565 | What's the contract on failure? Should RealPath be cleared? | |
lib/Support/Windows/Path.inc | ||
701 | Could we use FILE_NAME_NORMALIZED instead of 0? | |
702 | Seeing as how we support extended-length paths, it'd be nice to continue supporting them here. Perhaps we could retry similarly to what we do in Process::GetEnv ? | |
704 | I'd use SmallString<MAX_PATH> RealPathUTF8; |
lib/Support/Unix/Path.inc | ||
---|---|---|
522 | OK. | |
524–547 | Roger. I included it only because an earlier reviewer asked that I try it, and also in the hopes that someone can tell me how to make fgetattrlist do something useful. This API is horrible and I'll delete it gladly. | |
554–556 | /proc works elsewhere, like cygwin for instance. | |
557 | Since HasProc is static const, this check is executed exactly once. The result can be used on all file open operations besides the first. This code tries to get the real name, but it doesn't try hard. It quickly determines one method to try, and if that fails it gives up. The reason is because (a) the real path is used only to issue a warning, (b) getting the real path (including case) is unreliable on just about all platforms, and realpath is the most unreliable of all the APIs I've tested. As a result, false negatives for this diagnostic are business as usual, and (c) I don't want to slow down file open. | |
560–561 | Serious question: why is that nicer? | |
563–565 | RealPath is cleared on line 523. | |
lib/Support/Windows/Path.inc | ||
701 | OK. | |
702 | I had retry logic but I removed it. Like I say above, I try to get the real filename, but I don't try hard in the interest of keeping file open snappy. | |
704 | OK. |
lib/Support/Unix/Path.inc | ||
---|---|---|
560–561 | snprintf engages a ton of machinery behind the scenes, twine is much lighter weight. |
Hi Eric,
Thanks for working on this. :-)
Can you split the functionality that gets the RealPath into a new function (something like llvm::sys::fs::canonical() or llvm::sys::fs::realpath()) and make openFileForRead calls it? Ideally, this function should have an optional FD argument that can be used by openFileForRead or any other client in order to obtain the fast path for obtaining the real path, but should also contain the implementation for the slow path (in case FD isn't available).
This is going to be useful for a bunch of other client in clang that are now obtaining realpath in a adhoc/slow ways.
Can you split the functionality that gets the RealPath into a new function (something like llvm::sys::fs::canonical() or llvm::sys::fs::realpath()) and make openFileForRead calls it?
[...]
This is going to be useful for a bunch of other client in clang that are now obtaining realpath in a adhoc/slow ways.
I hesitate to do that. The code in openFileForRead is less reusable than it appears. It prefers speed over correctness since opening files for reading is a hot path. If we make it a separate dedicated API, clients will have a reasonable expectation that it returns the correct results. Making it correct necessarily means making it slow, which means it would no longer be useable in openFileForRead.
The only way I know to make realpath 100% correct 100% of the time is to readdir starting from the drive root, enumerating all the subdirectories to find one that matches the next path component, and recursing all the way up. I think I'm safe in saying that making roughly a bajillion API calls that hit the disk over and over is going to be a lot slower than making just one and hoping for the best. :-)
If you can think of a better way to make this correct 100% of the time on all the platforms LLVM supports, I'm all ears.
clang has been using realpath in the FileManager for a long time and it seems to be working fine, I'm not following why it isn't 100% correct. In the VFS, as far as the needs go, it has been used to detect symlink in the components and work as expected. If we have an API in the lines of llvm::sys::fs::canonical(StringRef Filename, SmallVectorImpl<char > &path, int FD = -1), where we could optionally do the fast lookup in case FD > 0 (or any platform independent semantic) it's a matter of specifying the API contract what's the expected behaviour.
clang has been using realpath in the FileManager for a long time and it seems to be working fine, I'm not following why it isn't 100% correct. In the VFS, as far as the needs go, it has been used to detect symlink in the components and work as expected.
Ah, we're talking about 2 different things. You want to: (a) get the absolute path(?), (b) resolve symlinks, and (c) eliminate "." and ".." components from the path. That is precisely realpath's purpose. I want to get the proper case for a path. realpath doesn't (necessarily) do that, so I have to fart around doing platform-specific-y things that are likely to succeed. I have no data to suggest that those platform-specific-y things are better at realpath's job than realpath is.
If I factored this code out into an API, I wouldn't call it canonicalize or realpath; I'd call it something like try_get_real_case or some such. Do you think something like that would be generally useful?
Hi Eric,
Sorry for the delay
Ah, we're talking about 2 different things. You want to: (a) get the absolute path(?), (b) resolve symlinks, and (c) eliminate "." and ".." components from the path. That is precisely realpath's purpose.
Right.
I want to get the proper case for a path. realpath doesn't (necessarily) do that, so I have to fart around doing platform-specific-y things that are likely to succeed. I have no data to suggest that those platform-specific-y things are better at realpath's job than realpath is.
What I'm suggesting here is that you abstract everything that guarantees returning the realpath (as in absolute, etc) into a separate function, lets call it canonical_path, even if that means that the only reliable option is calling the realpath function itself. Since on windows it's a different function, I see much value in having that abstracted out into a OS independent function. Additionally, I believe some of the magic you are current doing could be used as a fast path to obtain the absolute path as well, right? The more you could use that magic in canonical_path the better, It would be awesome to have fast paths for that.
If I factored this code out into an API, I wouldn't call it canonicalize or realpath; I'd call it something like try_get_real_case or some such. Do you think something like that would be generally useful?
Something along the lines of a getPathRealCase returning a bool seems fine to me. This function can do any necessary magic but also call canonical_path whenever it needs OS independent realpath fallback. Does that make sense?
lib/Support/Unix/Path.inc | ||
---|---|---|
522 | space between "if" and "(" |
What I'm suggesting here is that you abstract everything that guarantees returning the realpath (as in absolute, etc) into a separate function, lets call it canonical_path, even if that means that the only reliable option is calling the realpath function itself. Since on windows it's a different function, I see much value in having that abstracted out into a OS independent function.
This makes sense.
Additionally, I believe some of the magic you are current doing could be used as a fast path to obtain the absolute path as well, right?
I haven't benchmarked that. I'm guessing that querying /proc is not going to be faster than realpath. F_GETPATH might be, but that's speculation at this point.
This function can do any necessary magic but also call canonical_path whenever it needs OS independent realpath fallback. Does that make sense?
Actually, I'm unlikely to use any new canonical_path from openFileForRead. For a new canonical_path API, it'll actually need to get the answer right. That means fallbacks, retry logic, and long path handling. That's not appropriate for openFileForRead, whose job is primarily to open a file. It makes an opportunistic best-effort to get the real-case-name, and gives up quickly so as not to slow down the preprocessor.
So, I could implement a portable canonical_path API in this diff, but it won't be used from anywhere. That's why I've been resisting making this change. Just making sure that's understood. All that being said, I am happy to add it along with some tests if it means forward progress.
lib/Support/Unix/Path.inc | ||
---|---|---|
522 | Yup, thx. |
I agree, definitely not the place. After re-reading your patch I also agree that introducing canonical_path doesn't help much. However, it's very convenient to have bool getPathFromOpenFD(StringRef Filename, SmallVectorImpl<char> *RealPath) or something along these lines. The reason is that we can then re-use the logic if we find appropriate when implementing canonical_path. How does that sound?
So, I could implement a portable canonical_path API in this diff, but it won't be used from anywhere. That's why I've been resisting making this change. Just making sure that's understood. All that being said, I am happy to add it along with some tests if it means forward progress.
Don't bother :-)
We need tests for this, can you add some? Ideally unittests (somewhere in unittests/Support).
Hi Eric,
LGTM with the comments below.
lib/Support/Unix/Path.inc | ||
---|---|---|
553 | I would rather just have this logic inlined in the call site below. | |
554 | *can | |
556 | It's more common in LLVM to use a integer literal in the RHS of the comparison, can you change this here and elsewhere in the patch? | |
lib/Support/Windows/Path.inc | ||
831 | Ditto | |
837 | Ditto | |
unittests/Support/Path.cpp | ||
999 | Very nice! Can you also add a testcase for openFileForRead ? |
lib/Support/Unix/Path.inc | ||
---|---|---|
553 | It's called from 2 places. Aside from DRY, putting the initialization of the static local in one plac means the "/proc" is only looked up once per process. | |
556 | I can do that, but in general I prefer putting the constant on the LHS in case I accidentally type = instead of == (although it doesn't matter here). |
lib/Support/Unix/Path.inc | ||
---|---|---|
553 | Nice, I missed that it was used more than once. Please make hasProcSelfFD() static as well and move it inside llvm::sys::fs. |
lib/Support/Unix/Path.inc | ||
---|---|---|
553 |
It's in an anonymous namespace. What is the reason for preferring static?
It's already there. |
lib/Support/Unix/Path.inc | ||
---|---|---|
553 | To be coherent with the rest of the file; we are already using static for other helpers, not anonymous namespaces. |
LGTM to re-commit this one. In case you need to revert the clang side of this patch once more, no need to revert this one (assuming the issue isn't here)
Please remember to remove code which is disabled if you check it in.