This is an archive of the discontinued LLVM Phabricator instance.

In openFileForRead, attempt to fetch the actual name of the file on disk -- including case -- so that clang can later warn about non-portable #include and #import directives.
ClosedPublic

Authored by eric_niebler on May 2 2016, 5:22 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

eric_niebler retitled this revision from to In openFileForRead, attempt to fetch the actual name of the file on disk -- including case -- so that clang can later warn about non-portable #include and #import directives..
eric_niebler updated this object.
eric_niebler added a reviewer: bruno.
eric_niebler added a subscriber: llvm-commits.
majnemer added inline comments.
lib/Support/Unix/Path.inc
522

I'd go for an early return to reduce indentation.

524–547

Please remember to remove code which is disabled if you check it in.

554–556

This should be #ifdef __linux__

557

I wouldn't bother with this, it has a non-trivial amount of overhead on non-X86 platforms. I'd just fall-back to ::realpath if ::readlink failed.

In fact, why not fall back to ::realpath regardless of platform?

560–561

I think it'd be nicer to build a twine like so:

SmallString<32> ProcPathStorage;
StringRef ProcPath = Twine("/proc/self/fd/").concat(ResultFD).toNullTerminatedStringRef(ProcPathStorage);
563–565

What's the contract on failure? Should RealPath be cleared?

lib/Support/Windows/Path.inc
701

Could we use FILE_NAME_NORMALIZED instead of 0?

702

Seeing as how we support extended-length paths, it'd be nice to continue supporting them here. Perhaps we could retry similarly to what we do in Process::GetEnv ?

704

I'd use SmallString<MAX_PATH> RealPathUTF8;

eric_niebler added inline comments.May 3 2016, 5:05 PM
lib/Support/Unix/Path.inc
522

OK.

524–547

Roger. I included it only because an earlier reviewer asked that I try it, and also in the hopes that someone can tell me how to make fgetattrlist do something useful. This API is horrible and I'll delete it gladly.

554–556

/proc works elsewhere, like cygwin for instance.

557

Since HasProc is static const, this check is executed exactly once. The result can be used on all file open operations besides the first.

This code tries to get the real name, but it doesn't try hard. It quickly determines one method to try, and if that fails it gives up. The reason is because (a) the real path is used only to issue a warning, (b) getting the real path (including case) is unreliable on just about all platforms, and realpath is the most unreliable of all the APIs I've tested. As a result, false negatives for this diagnostic are business as usual, and (c) I don't want to slow down file open.

560–561

Serious question: why is that nicer?

563–565

RealPath is cleared on line 523.

lib/Support/Windows/Path.inc
701

OK.

702

I had retry logic but I removed it. Like I say above, I try to get the real filename, but I don't try hard in the interest of keeping file open snappy.

704

OK.

majnemer added inline comments.May 3 2016, 5:11 PM
lib/Support/Unix/Path.inc
560–561

snprintf engages a ton of machinery behind the scenes, twine is much lighter weight.

Review feedback

bruno edited edge metadata.May 10 2016, 10:00 PM

Hi Eric,

Thanks for working on this. :-)

Can you split the functionality that gets the RealPath into a new function (something like llvm::sys::fs::canonical() or llvm::sys::fs::realpath()) and make openFileForRead calls it? Ideally, this function should have an optional FD argument that can be used by openFileForRead or any other client in order to obtain the fast path for obtaining the real path, but should also contain the implementation for the slow path (in case FD isn't available).

This is going to be useful for a bunch of other client in clang that are now obtaining realpath in a adhoc/slow ways.

Can you split the functionality that gets the RealPath into a new function (something like llvm::sys::fs::canonical() or llvm::sys::fs::realpath()) and make openFileForRead calls it?
[...]
This is going to be useful for a bunch of other client in clang that are now obtaining realpath in a adhoc/slow ways.

I hesitate to do that. The code in openFileForRead is less reusable than it appears. It prefers speed over correctness since opening files for reading is a hot path. If we make it a separate dedicated API, clients will have a reasonable expectation that it returns the correct results. Making it correct necessarily means making it slow, which means it would no longer be useable in openFileForRead.

Could you please measure the slowdown? It'd help evaluate this decision.

The only way I know to make realpath 100% correct 100% of the time is to readdir starting from the drive root, enumerating all the subdirectories to find one that matches the next path component, and recursing all the way up. I think I'm safe in saying that making roughly a bajillion API calls that hit the disk over and over is going to be a lot slower than making just one and hoping for the best. :-)

If you can think of a better way to make this correct 100% of the time on all the platforms LLVM supports, I'm all ears.

bruno added a comment.May 12 2016, 1:19 PM

The only way I know to make realpath 100% correct 100% of the time is to readdir starting from the drive root, enumerating all the subdirectories to find one that matches the next path component, and recursing all the way up. I think I'm safe in saying that making roughly a bajillion API calls that hit the disk over and over is going to be a lot slower than making just one and hoping for the best. :-)

If you can think of a better way to make this correct 100% of the time on all the platforms LLVM supports, I'm all ears.

clang has been using realpath in the FileManager for a long time and it seems to be working fine, I'm not following why it isn't 100% correct. In the VFS, as far as the needs go, it has been used to detect symlink in the components and work as expected. If we have an API in the lines of llvm::sys::fs::canonical(StringRef Filename, SmallVectorImpl<char > &path, int FD = -1), where we could optionally do the fast lookup in case FD > 0 (or any platform independent semantic) it's a matter of specifying the API contract what's the expected behaviour.

clang has been using realpath in the FileManager for a long time and it seems to be working fine, I'm not following why it isn't 100% correct. In the VFS, as far as the needs go, it has been used to detect symlink in the components and work as expected.

Ah, we're talking about 2 different things. You want to: (a) get the absolute path(?), (b) resolve symlinks, and (c) eliminate "." and ".." components from the path. That is precisely realpath's purpose. I want to get the proper case for a path. realpath doesn't (necessarily) do that, so I have to fart around doing platform-specific-y things that are likely to succeed. I have no data to suggest that those platform-specific-y things are better at realpath's job than realpath is.

If I factored this code out into an API, I wouldn't call it canonicalize or realpath; I'd call it something like try_get_real_case or some such. Do you think something like that would be generally useful?

Any more comments on this one? @majnemer? @bruno?

bruno added a comment.May 19 2016, 5:56 PM

Hi Eric,

Sorry for the delay

Ah, we're talking about 2 different things. You want to: (a) get the absolute path(?), (b) resolve symlinks, and (c) eliminate "." and ".." components from the path. That is precisely realpath's purpose.

Right.

I want to get the proper case for a path. realpath doesn't (necessarily) do that, so I have to fart around doing platform-specific-y things that are likely to succeed. I have no data to suggest that those platform-specific-y things are better at realpath's job than realpath is.

What I'm suggesting here is that you abstract everything that guarantees returning the realpath (as in absolute, etc) into a separate function, lets call it canonical_path, even if that means that the only reliable option is calling the realpath function itself. Since on windows it's a different function, I see much value in having that abstracted out into a OS independent function. Additionally, I believe some of the magic you are current doing could be used as a fast path to obtain the absolute path as well, right? The more you could use that magic in canonical_path the better, It would be awesome to have fast paths for that.

If I factored this code out into an API, I wouldn't call it canonicalize or realpath; I'd call it something like try_get_real_case or some such. Do you think something like that would be generally useful?

Something along the lines of a getPathRealCase returning a bool seems fine to me. This function can do any necessary magic but also call canonical_path whenever it needs OS independent realpath fallback. Does that make sense?

lib/Support/Unix/Path.inc
522

space between "if" and "("

What I'm suggesting here is that you abstract everything that guarantees returning the realpath (as in absolute, etc) into a separate function, lets call it canonical_path, even if that means that the only reliable option is calling the realpath function itself. Since on windows it's a different function, I see much value in having that abstracted out into a OS independent function.

This makes sense.

Additionally, I believe some of the magic you are current doing could be used as a fast path to obtain the absolute path as well, right?

I haven't benchmarked that. I'm guessing that querying /proc is not going to be faster than realpath. F_GETPATH might be, but that's speculation at this point.

This function can do any necessary magic but also call canonical_path whenever it needs OS independent realpath fallback. Does that make sense?

Actually, I'm unlikely to use any new canonical_path from openFileForRead. For a new canonical_path API, it'll actually need to get the answer right. That means fallbacks, retry logic, and long path handling. That's not appropriate for openFileForRead, whose job is primarily to open a file. It makes an opportunistic best-effort to get the real-case-name, and gives up quickly so as not to slow down the preprocessor.

So, I could implement a portable canonical_path API in this diff, but it won't be used from anywhere. That's why I've been resisting making this change. Just making sure that's understood. All that being said, I am happy to add it along with some tests if it means forward progress.

lib/Support/Unix/Path.inc
522

Yup, thx.

bruno added a comment.May 23 2016, 5:46 PM

What I'm suggesting here is that you abstract everything that guarantees returning the realpath (as in absolute, etc) into a separate function, lets call it canonical_path, even if that means that the only reliable option is calling the realpath function itself. Since on windows it's a different function, I see much value in having that abstracted out into a OS independent function.

This makes sense.

Additionally, I believe some of the magic you are current doing could be used as a fast path to obtain the absolute path as well, right?

I haven't benchmarked that. I'm guessing that querying /proc is not going to be faster than realpath. F_GETPATH might be, but that's speculation at this point.

This function can do any necessary magic but also call canonical_path whenever it needs OS independent realpath fallback. Does that make sense?

Actually, I'm unlikely to use any new canonical_path from openFileForRead. For a new canonical_path API, it'll actually need to get the answer right. That means fallbacks, retry logic, and long path handling. That's not appropriate for openFileForRead, whose job is primarily to open a file. It makes an opportunistic best-effort to get the real-case-name, and gives up quickly so as not to slow down the preprocessor.

I agree, definitely not the place. After re-reading your patch I also agree that introducing canonical_path doesn't help much. However, it's very convenient to have bool getPathFromOpenFD(StringRef Filename, SmallVectorImpl<char> *RealPath) or something along these lines. The reason is that we can then re-use the logic if we find appropriate when implementing canonical_path. How does that sound?

So, I could implement a portable canonical_path API in this diff, but it won't be used from anywhere. That's why I've been resisting making this change. Just making sure that's understood. All that being said, I am happy to add it along with some tests if it means forward progress.

Don't bother :-)

We need tests for this, can you add some? Ideally unittests (somewhere in unittests/Support).

eric_niebler edited edge metadata.

Add getPathFromOpenFD and unittest, as requested.

bruno added a comment.May 31 2016, 6:37 PM

Hi Eric,

LGTM with the comments below.

lib/Support/Unix/Path.inc
553

I would rather just have this logic inlined in the call site below.

554

*can

556

It's more common in LLVM to use a integer literal in the RHS of the comparison, can you change this here and elsewhere in the patch?

lib/Support/Windows/Path.inc
831

Ditto

837

Ditto

unittests/Support/Path.cpp
999 ↗(On Diff #58856)

Very nice! Can you also add a testcase for openFileForRead ?

eric_niebler added inline comments.Jun 2 2016, 3:53 PM
lib/Support/Unix/Path.inc
553

It's called from 2 places. Aside from DRY, putting the initialization of the static local in one plac means the "/proc" is only looked up once per process.

556

I can do that, but in general I prefer putting the constant on the LHS in case I accidentally type = instead of == (although it doesn't matter here).

Add unittest for openFileForRead; assorted stylistic tweaks.

bruno added inline comments.Jun 2 2016, 5:10 PM
lib/Support/Unix/Path.inc
553

Nice, I missed that it was used more than once. Please make hasProcSelfFD() static as well and move it inside llvm::sys::fs.

eric_niebler added inline comments.Jun 3 2016, 9:09 AM
lib/Support/Unix/Path.inc
553

Please make hasProcSelfFD() static

It's in an anonymous namespace. What is the reason for preferring static?

and move it inside llvm::sys::fs

It's already there.

bruno added inline comments.Jun 3 2016, 9:38 AM
lib/Support/Unix/Path.inc
553

To be coherent with the rest of the file; we are already using static for other helpers, not anonymous namespaces.

make hasProcSelfFD() static

bruno accepted this revision.Jun 3 2016, 10:51 AM
bruno edited edge metadata.

LGTM!

This revision is now accepted and ready to land.Jun 3 2016, 10:51 AM
twoh closed this revision.Jun 3 2016, 12:29 PM
twoh added a subscriber: twoh.

I have commit in r271704: http://reviews.llvm.org/rL271704

twoh added a comment.Jun 3 2016, 8:46 PM

Reverted in r271764.

eric_niebler reopened this revision.Jun 8 2016, 10:15 AM

Reopening this and the updated D19843.

This revision is now accepted and ready to land.Jun 8 2016, 10:15 AM
bruno added a comment.Jun 8 2016, 12:02 PM

LGTM to re-commit this one. In case you need to revert the clang side of this patch once more, no need to revert this one (assuming the issue isn't here)