This is an archive of the discontinued LLVM Phabricator instance.

[clang] On Windows, ignore case and separators when discarding duplicate dependency file paths.
ClosedPublic

Authored by saudi on May 12 2021, 10:58 AM.

Details

Summary

We noticed that the dependency collectors of clang may report duplicate paths under Windows, when the differences are only in casing or separators (/ vs \)
This problem affects the correctness of the information output by clang-scan-deps, which in turn, affects our integration with the Fastbuild cache.

For separators, it can happen as clang internally computes the full path of a header from an #include directive using native separator.

Example with argument -I C:\includes:
#include <subdir/header.h> -> internally represented in the FileManager as C:\includes\subdir/header.h
#include "header.h" from another header in subdir folder -> C:\includes\subdir\header.h

The same applies with casing discrepancies that can be found sometimes in headers from external SDKs.

Diff Detail

Event Timeline

saudi requested review of this revision.May 12 2021, 10:58 AM
saudi created this revision.
rnk added inline comments.May 12 2021, 11:42 AM
clang/lib/Frontend/DependencyFile.cpp
145

I feel like this should somehow be a property of the input virtual filesystem: we should be able to ask the VFS to do the path canonicalization for us, or ask if the FS is case insensitive and do the lower-casing if so.

148

nit: using sys::path::native here seems nicer. It replaces ~ with the home dir, but that doesn't seem important.

clang/test/Frontend/dependency-gen-windows-duplicates.c
13

Please reorder the includes and adjust the test to look for SubDir/X.h with capitalization to verify that clang isn't simply producing all lower case output. Your code does this correctly, it canonicalizes the path to check for duplicates, and then prints the filename as written, I just want the test case to reflect that.

amccarth added inline comments.
clang/lib/Frontend/DependencyFile.cpp
145

Additional complication:

Windows NTFS partitions can have case-sensitivity enabled on a per-directory basis, so there is no single answer for the filesystem or even the host OS. Granted, this is not yet a commonly used feature.

I wonder about the (also rarely used) case of cross compiling for a Linux target on a Windows build machine.

149

If ::tolower is the same as std::tolower (e.g., if the header declared it in both the std and global namespaces), and if the locale is the default "C" locale, then this downcases just the 26 ASCII capital letters. Windows considers more characters when it's wearing its case-blinding glasses.
For example, the pre-composed U+00C1 LATIN CAPITAL LETTER A WITH ACUTE ACCENT and U+00E1 LATIN SMALL LETTER A WITH ACUTE ACCENT characters match in case-blind file names.

I'll concede that this is probably an existing problem elsewhere in LLVM, so it may not be a high enough priority to worry about right now. It just underscores the need for better centralized handling of paths and file names, so that we'll have a handle these sorts of problems if/when we ever accept that Windows is not Posix with backward slashes.

aganea added inline comments.May 12 2021, 3:15 PM
clang/lib/Frontend/DependencyFile.cpp
145

Windows NTFS partitions can have case-sensitivity enabled on a per-directory basis, so there is no single answer for the filesystem or even the host OS. Granted, this is not yet a commonly used feature.

I think case-sensitiveness was added for WSL, I'm not sure that feature would be used outside of that context. This would otherwise require quering the filesystem for the case of each directory component.

I wonder about the (also rarely used) case of cross compiling for a Linux target on a Windows build machine.

It's not that rare :) We do cross-compile all of our platforms from Windows, most of them are POSIX/Linux variants. This could become a problem although if we decided to share the dependency list accross systems, for example if we shared the cache artefacts (.OBJ files) between Windows and Linux bots. In which case a path renormalization would be necessary. The easiest solution for that purpose could be a case-insensitive FS on Linux and always execute the code that @saudi added in this patch. But that's not the goal right now.

We share obj files built on linux and on windows. So that's a goal for us.

We share obj files built on linux and on windows. So that's a goal for us.

Hello @thakis, I believe you don't use this codepath to generate the cache key, do you? I thought you had a custom python script to extract the dependencies?

saudi updated this revision to Diff 344947.May 13 2021, 7:04 AM

Updated for comments from @rnk, except for the canonicalization part.

Are you suggesting that the DependencyCollector should be called with the already canonicalized paths?

There are a few code paths leading to the DependencyCollector::addDependency, for some of them the access to the FS are not trivial to me, but I might just not have found the right angle to tackle this.

We share obj files built on linux and on windows. So that's a goal for us.

Hello @thakis, I believe you don't use this codepath to generate the cache key, do you? I thought you had a custom python script to extract the dependencies?

Oh, is this only used for -MD / -MMD? If so, yes, we don't use this on Windows atm. But someone's working on setting a win->linux cross build, and that would use this code path I think.

rnk accepted this revision.May 14 2021, 8:58 AM

lgtm

Ideally we should avoid the ifdef, but we don't have a good proxy for case sensitivity of the filesystem here.

This revision is now accepted and ready to land.May 14 2021, 8:58 AM
This revision was landed with ongoing or failed builds.May 17 2021, 7:34 AM
This revision was automatically updated to reflect the committed changes.
saudi marked 2 inline comments as done.