When on windows gnu-ar treats member names as case insensitive. This diff implements the same behaviour.
Details
Diff Detail
Event Timeline
I am probably not the best person to review it (I know little about llvm-ar code).
But anyways, my comments are below. (Generally looks good to me).
llvm/test/tools/llvm-ar/windows-case.test | ||
---|---|---|
1 ↗ | (On Diff #221784) | 'windows-case.test' sounds not so clear for me. Perhaps windows-name-case.test would be better? Do we have something like this test for non-windows, btw? Would be probably nice to show the behavior |
18 ↗ | (On Diff #221784) | T is create a thin archive, why do you need it here? (the test passes without it). |
23 ↗ | (On Diff #221784) | The same. |
llvm/tools/llvm-ar/llvm-ar.cpp | ||
494 | You do not need wrapping this piece into curly bracers. |
llvm/test/tools/llvm-ar/windows-case.test | ||
---|---|---|
6 ↗ | (On Diff #221784) | Can you check the contents after this step too to make sure the filename case is preserved, and the normalization is only used for comparison? |
14 ↗ | (On Diff #221784) | Can you filecheck w/ the actual contents (FileCheck --input-file=%t/archive.a) instead of using llvm-ar t? It would be good to assert the file contents are replaced too. |
llvm/tools/llvm-ar/llvm-ar.cpp | ||
492–493 | nit: llvm::transform(NormalizedPath, NormalizedPath.begin(), ::toupper); |
llvm/test/tools/llvm-ar/windows-case.test | ||
---|---|---|
18 ↗ | (On Diff #221784) | This was to specifically test thin archive members as they are stored as paths while the "regular archive" members just use a file name. The code path is similar for both cases however, would you prefer it being removed? |
llvm/test/tools/llvm-ar/windows-case.test | ||
---|---|---|
18 ↗ | (On Diff #221784) | Probably not. I think I've just misunderstoold how this option works together with others. LG. |
I hope we can avoid case insensitiveness whenever possible, so can you talk about your use case?
// binutils-gdb/include/filenames.h #if defined(__MSDOS__) || defined(_WIN32) || defined(__OS2__) || defined (__CYGWIN__) # ifndef HAVE_DOS_BASED_FILE_SYSTEM # define HAVE_DOS_BASED_FILE_SYSTEM 1 # endif # ifndef HAVE_CASE_INSENSITIVE_FILE_SYSTEM # define HAVE_CASE_INSENSITIVE_FILE_SYSTEM 1 # endif // libiberty/filename_cmp.c int filename_cmp (const char *s1, const char *s2) { #if !defined(HAVE_DOS_BASED_FILE_SYSTEM) \ && !defined(HAVE_CASE_INSENSITIVE_FILE_SYSTEM) return strcmp(s1, s2); #else for (;;) { int c1 = *s1; int c2 = *s2; #if defined (HAVE_CASE_INSENSITIVE_FILE_SYSTEM) c1 = TOLOWER (c1); ///// POSIX/C locale case mapping, locale agnostic c2 = TOLOWER (c2); #endif ...
Quoting Per-directory case sensitivity and WSL:
Applications can pass the FILE_FLAG_POSIX_SEMANTICS flag to the CreateFile API to indicate that they want the path to be treated as case sensitive.
My understanding is that although NTFS is case sensitive, the windows API makes file operations case insensitive by default.
This means on Windows the two paths below are functionally the same:
- c:\a\path\example
- C:\A\PATH\EXAMPLE
As your snippet shows, gnu-ar treats these paths as the same when on Windows when doing member name comparisons e.g. when using the replace command one path will replace the other.
Under current llvm-ar behaviour they are considered as two different paths. If path 1 is stored in an archive, you can not select the member using path 2.
This diff was put up for review to keep behaviour consistent with gnu-ar and with default windows path behaviour.
llvm/test/tools/llvm-ar/windows-name-case.test | ||
---|---|---|
10–15 | The first archive should be checked that it doesn't contain an uppercase name (would be good to check if gnu-ar does this on windows): # RUN: llvm-ar rc %t/archive.a %t/lowerCase/file.txt # RUN: FileCheck %s -input-file=%t/archive.a --check-prefix=LOWER --implicit-check-not=FILE.TXT # RUN: llvm-ar rc %t/archive.a %t/UPPERCASE/FILE.TXT # RUN: FileCheck %s -input-file=%t/archive.a --check-prefix=UPPER --implicit-check-not=file.txt # LOWER: file.txt # UPPER: FILE.TXT |
Agreed.
This diff was put up for review to keep behaviour consistent with gnu-ar and with default windows path behaviour.
I asked because I want to make sure this patch is put up for practical use, not for a compatibility corner case that may rarely matter. I think most people may realize case insensitiveness on file systems is a bad idea. If my understanding is correct, llvm-lib used by Windows. llvm-ar is an ELF tool and it should match a generic ELF platform as close as possible and such platform disparity should be as little as possible.
Thanks for the quick response. I agree that case insensitiveness on file systems is a bad idea, however the reality is that Windows has case a case insensitive file system (at least by default) . To be clear our customers will always be using llvm-ar on a Windows host along side the other ELF tools they are using to cross compile to another target. The case I outlined above could occur due to human error or presumption that the Windows path semantics would extend not just to when accessing the file system but when manipulating those members.
llvm-ar normalises relative paths in order to properly compare them rather than using a naive string comparison. This is due to thin archive members representing file paths, and as such different paths can point to the same file. In the case of Windows this also applies to paths of different cases, in the case above those paths represent the same file in the Windows file system and I think llvm-ar should treat them as such.
Also llvm-ar already differs in behaviour depending on host platform, the main one being the default format of a newly created archive. I don't think this behaviour difference would surprise anyone using a Windows machine, particularly if they have used ar on Windows in the past which we know follows this behaviour.
llvm/test/tools/llvm-ar/windows-name-case.test | ||
---|---|---|
10–15 | I've check and this is the case with gnu-ar on Windows. |
While this probably can be argued to be true, relying on the opposite (being able to store files that differ only in case, like in the linux kernel) also is best avoided if possible, especially if you're going to build the codebase on some other platform than the target.
If my understanding is correct, llvm-lib used by Windows. llvm-ar is an ELF tool and it should match a generic ELF platform as close as possible and such platform disparity should be as little as possible.
Not quite - llvm-lib is used as drop-in replacement for MSVC lib.exe, but for MinGW cases (where you normally use binutils tools) you'd use (llvm-)ar for COFF objects as well. (And llvm-ar is used for MachO as well, not only ELF/COFF.)
FWIW, this particular host platform specific behaviour should in practice be minimal. If possible, llvm-ar inspects the input object files (or existing archive) and decides the default format based on that. The effect is that you get a darwin-style static library if you run llvm-ar on darwin, only if you create a new empty archive where there's no object files to decide from.
Regarding this patch, I'm ok with it (concept-wise, I haven't checked the implementation yet), as building code which relies on distinguishing between filenames that only differ in case, on windows, already in itself is problematic. And as there's predecent for it in GNU binutils (which can be used for cross compiling to ELF platforms, from windows), I think we should do this.
I sympathize MaskRay, but in practice I think we want this because on Windows filenames are case-insensitive. That is what users expect, and I personally don't care too much about cases when passing a filename to a command. If (only) ar distinguishes pathnames in different cases, I'd be surprised.
llvm/tools/llvm-ar/llvm-ar.cpp | ||
---|---|---|
491 | Instead of detecting the OS at runtime, you should use #ifdef _WIN32 because once you compile this for Windows, the run-time OS must be Windows. This new code needs a comment justifying why we are doing this. | |
492 | This is perhaps my personal preference, but I prefer making it lowercase rather than uppercase when normalizing. |
llvm/tools/llvm-ar/llvm-ar.cpp | ||
---|---|---|
492 | This may need to use POSIX/C locale llvm::toUpper from llvm/ADT/StringExtras.h if my reading of binutils source is correct. A manual test that involves non-ASCII Unicode is needed. |
Actually, don't you have to use the case conversion as defined by the Unicode standard for international users whose language has the notion of uppercase/lowercase and has characters outside ASCII?
NTFS and the layers above it don't normalize filenames (in the Unicode sense) and I believe that path names are stored as sequences of 16bit integers, without any guarantee of being well-formed UCS-2/UTF_16. So I don't think that a comparison via locale-aware case folding would be correct for paths. I don't have sources for this, it's just what I recall from having done experiments some time ago.
Use of CompareStringOrdinal() is suggested by https://docs.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications
Generally looking good.
llvm/tools/llvm-ar/llvm-ar.cpp | ||
---|---|---|
496 | I'd add a function comment here to explain what this function does for Windows. On Windows, we want a case-insensitive comparison as defined by the Unicode standard (which is I believe what CompareStringOrdinal implements), so that filenames in archives are compared case-insensitive manner. | |
496 | nit: remove const from Path2. | |
497 | nit: use ifdef and swap the contents of then and else clauses |
Not going to review the code (I'll leave others to do that), but a quick reminder to please include the full context when uploading diffs.
Thinking about the case-sensitivity issue and after chatting with another colleague, it occurred to me that we should probably document the actual behaviour of case sensitivity in the llvm-ar documentation (not looked at the exact agreed-upon behaviour yet, but saying something like "filenames are written case-sensitively, but on Windows all filename comparisons are performed case-insensitively" or whatever the equivalent would be is probably a good idea). This is because it is not unheard of for users to cross-build an archive on Windows that will later be used on Linux.
Removed unneeded const, swapped use of ifndef and added comment to comparePaths function.
llvm/tools/llvm-ar/llvm-ar.cpp | ||
---|---|---|
496 | Use of CompareStringOrdinal it is not compliant to the Unicode standard as when string matching Windows file paths do not conform to the Unicode standard. I think this is fine though, using CompareStringOrdinal to mirror Windows behaviour makes more sense than a string comparison that conforms to the Unicode standard. |
Case insensitivity and platform differences do make me sad, but if people think it is the right thing to do on Windows I'll not insist.
Note, GNU ar uses POSIX/C locale case matching. If we want to use CompareStringOrdinal, we should probably create an issue on https://sourceware.org/bugzilla
#if defined (HAVE_CASE_INSENSITIVE_FILE_SYSTEM) c1 = TOLOWER (c1); c2 = TOLOWER (c2); #endif
As jhenderson suggested, could you add a description about this behavior? Looks like llvm-ar doesn't have a manual page, so the second best thing to add a description is probably the help message for --help.
Well I'm not happy about this change, but this is probably an unavoidable consequence of the decision that Microsoft made in the early 80s...
We have a command guide entry: http://llvm.org/docs/CommandGuide/llvm-ar.html
And it looks like that is also available (at least on debian-based systems) via man llvm-ar-8
+1 to documenting it there.
Well I'm not happy about this change, but this is probably an unavoidable consequence of the decision that Microsoft made in the early 80s...
:)
llvm/tools/llvm-ar/llvm-ar.cpp | ||
---|---|---|
510 | I'm not quite sure about this change... a few of the callsites before were Name == normalizePath(Path), not normalizePath(Name) == normalizePath(Path). My past experiences of compatibility testing llvm-ar vs GNU ar has largely been paged out, but I think this may have been one of the differences. It may actually be something we want, but we should test it. e.g. to test the performReadOperation can you see if extracting "foo/file.txt" will end up extracting "bar/file.txt" (in a situation where CompareFullPath is false)? |
Update llvm-ar command guide with case insensitivity details, and include a test for archived files with paths for names.
Due to updating the command guide I have not added the case insensitivity details to the llvm-ar help text. Would it be preferred to have these details in both? Also the test name "path-names" feels awkward, if anyone has a better suggestion I would be happy to change it.
llvm/tools/llvm-ar/llvm-ar.cpp | ||
---|---|---|
510 | You are correct. I have added a test for this behaviour which now matches that of gnu-ar. |
llvm/docs/CommandGuide/llvm-ar.rst | ||
---|---|---|
40–41 ↗ | (On Diff #224621) | Expressing Unix filesystems as case-sensitive is perhaps correct, but the more precise way of expressing it is that the filesystem layer doesn't have a notion of character case. |
41 ↗ | (On Diff #224621) | Looks like focusing on visible behavior makes this document easier to understand. From the user's point of view, the difference is this: If a new file is given to llvm-ar to add or replace an existing file, the filename of the new file is compared to each member of the archive. On Windows, this comparison is done in a case-insensitive manner (e.g. if you already have foo.o and add/replace FOO.o, it will replace the existing file). On other OSes, this comparison is done just by byte-by-byte comparison. |
44 ↗ | (On Diff #224621) | What does it actually mean by non-standard compliant? |
Updated llvm-ar doc to clarify non-windows behaviour and why the windows file system is not Unicode compliant.
llvm/docs/CommandGuide/llvm-ar.rst | ||
---|---|---|
41 ↗ | (On Diff #224621) | Has the newer diff clarified things? I'm hesitant to use a more detailed explanation with examples to clarify the behaviour surrounding character case, to me this subsection already seems to be escaping the scope of a command guide. Although I see the value in having this OS difference documented I don't think it requires much explanation, this behaviour matches Windows and should not cause any surprise to the Windows user. |
LGTM
llvm/docs/CommandGuide/llvm-ar.rst | ||
---|---|---|
41 ↗ | (On Diff #224621) | The new document looks much better! Thanks. I'm not sure if we have to mention the Unicode standard's case folding rules at this point -- I'd probably remove the sentences and merge the following paragraph into this paragraph. But it's up to you. |
The first archive should be checked that it doesn't contain an uppercase name (would be good to check if gnu-ar does this on windows):