This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Restore replace_path_prefix instead of startswith
ClosedPublic

Authored by saudi on Mar 26 2020, 10:40 AM.

Details

Summary

In D49466, sys::path::replace_path_prefix was used instead startswith.
However those were reverted later (commit rG3bb24bf25767ef5bbcef958b484e7a06d8689204) due to broken Windows tests.

This patch is to restore those replace_path_prefix calls. The objective, in a follow-up patch, is to allow the prefix matching to be case-insensitive under Windows.

This patch reverts commit rG3bb24bf25767ef5bbcef958b484e7a06d8689204 except for the test fix it contains.

Diff Detail

Event Timeline

saudi created this revision.Mar 26 2020, 10:40 AM
MaskRay added a comment.EditedMar 26 2020, 10:56 AM

Well C:\hello match both C:/hello and c:\hello but not C:\helloa? This behavior will look reasonable to me, though I should note that GCC does something similar to startswith.

Maybe we should tell them (https://gcc.gnu.org/pipermail/gcc/2020-March/) that we will match path components.

saudi added a comment.Mar 26 2020, 2:57 PM

Well C:\hello match both C:/hello and c:\hello but not C:\helloa? This behavior will look reasonable to me, though I should note that GCC does something similar to startswith.

Maybe we should tell them (https://gcc.gnu.org/pipermail/gcc/2020-March/) that we will match path components.

The examples you gave (C:\hello matching both C:/hello and c:\hello) is what I would like to do as a next step, in another patch, and would be only when compiling on Windows.

But indeed, this patch changes the behavior and matches entirely the last component for macro/debug prefix replacement : -fdebug-prefix-map=C:\hello=xxx would not replace C:\helloa anymore
This is because replace_path_prefix 's strict param is set to true in those cases.

Note that this was originally done in D49466 and got reverted because of the breaking Windows tests.
I restored it, but this change of behavior is not really required for the next step I was mentionning.

saudi updated this revision to Diff 253917.Mar 31 2020, 9:49 AM

Update patch, where I removed the change of behavior (not using the strict mode of replace_path_prefix).
We can change that behavior separately in a later commit.

I also fixed and added a unit test for replace_path_prefix, for a failing scenario discovered running the regression tests.

saudi updated this revision to Diff 253924.Mar 31 2020, 10:27 AM

Added checks for empty map to avoid unneeded string copies.

MaskRay accepted this revision.Mar 31 2020, 10:47 AM

LGTM.

clang/test/Preprocessor/file_test.c
10–12

Thanks for making win/non-win separated. This is clearer.

This revision is now accepted and ready to land.Mar 31 2020, 10:47 AM
saudi updated this revision to Diff 256290.Apr 9 2020, 7:40 AM

Update patch after the simplifications done in D77223.

I'm modifying this patch to also include step 3) as described in D77223 : make the Windows version of replace_path_prefix case and separator insensitive.

Without the separator-insensitive part, the regression test CHECK-REMOVE in file_test_windows.c would fail.
This is because the generated code contains both cases of separators after the prefix (\ the .c file, and / for the .h file).

saudi requested review of this revision.Apr 9 2020, 7:45 AM
saudi added a comment.Apr 16 2020, 6:42 AM

Ping!

I had to do some modifications after the patch was accepted, since a test was broken. See the comment with my last update.

saudi added a comment.EditedApr 28 2020, 11:39 AM

Ping - @MaskRay would you have time to take a quick look, please?

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2020, 11:39 AM
MaskRay added inline comments.Apr 28 2020, 4:01 PM
clang/test/Preprocessor/file_test_windows.c
13

Can {{/|\\\\}} be simplified?

18

ditto

saudi added inline comments.Apr 29 2020, 7:26 AM
clang/test/Preprocessor/file_test_windows.c
13

Yes, however it would be /, which looks odd as compared to the line above that uses \ for the file_test_windows.c file.
This is because of how the include file full paths are resolved, where / are introduced.

Should I try to fix that behavior?
Or maybe I could leave this test with / instead of {{/|\\\\}}, leaving the coherence fix for a separate patch. What do you think?

MaskRay added inline comments.Apr 29 2020, 9:32 AM
clang/test/Preprocessor/file_test_windows.c
13

The coherence fix can be made in a separate patch. When that get fixes, the updated test will clearly show which has been change.

saudi updated this revision to Diff 260961.Apr 29 2020, 11:07 AM

rebase, and apply suggestions (removed regex matching in test file_test_windows.c test).

saudi marked 3 inline comments as done.Apr 29 2020, 12:25 PM
MaskRay added inline comments.
llvm/lib/Support/Path.cpp
510

I am sure toLower works when Unicode is considered.

@amccarth has made many useful comments in this area before, so I added him as a blocking reviewer.

510

s/am sure/am not sure/

MaskRay requested changes to this revision.Apr 29 2020, 1:30 PM
This revision now requires changes to proceed.Apr 29 2020, 1:30 PM

Background:

The llvm::sys::Path functions are designed to deal with paths that are either Posix-style or (a subset of) Windows styles. We also have the virtual file system (VFS), specifically the redirecting file system, which can truly be hybrids, mixing Posix with Windows styles in a single path.

Attempts to canonicalize VFS paths to one or the other wasn't feasible. It would require many tests to be duplicated for Posix or Windows. Since one of the benefits of VFS is to write single tests that work on both file systems, this didn't seem the right way to go. Furthermore, the general clang approach to file paths is not to canonicalize but to blindly concatenate.

I got nearly all of the VFS tests, and several others, working on Windows mostly by intentionally avoiding the llvm::sys::Path APIs when the path in question may be one of these hybrid styles.

I'll take a closer look at the details in this patch later today.

Thanks for including me in this discussion.

I think this can work and not interfere with hybrid VFS paths, but please make sure to test on the all VFS tests both in both LLVM and Clang.

The case blindness is incomplete as noted in the inline comments, but maybe it's "good enough." I don't know how strictly other parts of LLVM handle this. Doing better will be hard because you'd have to compare code point by code point rather than UTF-8 code unit by code unit. Also note that Windows allows for case-sensitive folders as an experimental feature. (NTFS actually is case sensitive, but the OS hides that detail.)

As noted inline, please add a path style parameter to replace_path_prefix and pass it along to starts_with.

llvm/lib/Support/Path.cpp
510

If I recall correctly, the case insensitivity in Windows paths is not locale-aware. In fact, NTFS has a hard-coded case table. So that's not an issue. Also not an issue is Unicode normalization. I'm pretty sure you can name one file "Ä" and another "Ä" and they will remain distinct because one is a pre-composited code point and the other is a regular "A" with a combining mark. So that's another thing you don't have to worry about.

But it looks like toLower only handles basic ASCII, so, yeah, Unicode path names won't benefit from this attempt at case-blind comparison. I'm not sure what facilities LLVM has for case folding for the rest of the BMP. Maybe this is good enough for now? I don't have any data on how common it is for people to expect proper case-blind comparisons of non-ASCII characters in Windows paths.

524

I wonder if replace_path_prefix should take a path style (defaulted to native) and pass it along to starts_with. That would give future users of replace_path_prefix the power to control whether they get the Windows treatment.

saudi updated this revision to Diff 262192.May 5 2020, 12:44 PM

Added Style parameter to replace_path_prefix, as suggested.
This allowed to make the replace_path_prefix unit-test platform independent, removing a #ifdef _WIN32 that I added in a previous version of the patch.

The tests ninja check-all passed on windows.
@amccarth : Does it cover all the VFS tests you suggested that I double-check?

I think this is pretty close now.

llvm/lib/Support/Path.cpp
510

I don't have an opinion on how to handle the Unicode/case issues. I don't know how important "perfect" case handling is or whether it should block "good enough" given that we don't have it know and this will help many. So I'm not going to block on that unless somebody with more info or stronger opinions chimes in.

[I just read that Windows 10 is now letting people experiment with case-sensitive portions of a file system. So who knows where this is going in the future!]

519

I like that this has the style parameter now. Thanks!

But naming it PrefixStyle seems odd. It suggests that it applies to all of the OldPrefix and NewPrefix but not the Path. Can you call it simply style to be consistent with most everything else in this path support file?

llvm/unittests/Support/Path.cpp
1342

No #ifdef in the test now. Yay!

@amccarth : Does it cover all the VFS tests you suggested that I double-check?

Yes. Thanks.

dankm added a comment.May 5 2020, 8:35 PM

I like the idea of this, and fits what I had in mind for the original D49466 review. I want to take a closer look, but right now LGTM.

saudi updated this revision to Diff 262368.May 6 2020, 7:03 AM

Renamed the argument PrefixStyle to style.

Not that it matches the rest of other arguments declared in Path.h, however it doesn't follow the LLVM code guidelines.

replace_path_prefix is currently the only function there with CamelCase arguments OldPrefix and NewPrefix.
Should-I rename those to oldPrefix and newPrefix to be more homogenous with the rest of the file?

I suppose the other choice would be to rename the style parameter, e.g. Style St

saudi updated this revision to Diff 262371.May 6 2020, 7:06 AM

(missed rename PrefixStyle to style in comment)

amccarth accepted this revision.May 6 2020, 8:14 AM

LGTM. Thanks.

I'm less worried about the details of style (like camel casing), since I have the impression things are changing. In general, stick to prevalent style of the file when it's discernable. My issue with PrefixStyle was that the name could be confusing.

saudi added a comment.May 6 2020, 8:37 AM

LGTM. Thanks.

I'm less worried about the details of style (like camel casing), since I have the impression things are changing. In general, stick to prevalent style of the file when it's discernable. My issue with PrefixStyle was that the name could be confusing.

Thanks!
@MaskRay, are you ok with this patch? I think it needs your approval, as it doesn't show as "Ready to land".

saudi marked 4 inline comments as done.May 12 2020, 8:48 AM
saudi added a comment.May 12 2020, 8:59 AM

Ping! @MaskRay you are a blocking reviewer, would you mind taking a final look please?

MaskRay accepted this revision.May 12 2020, 9:17 AM

Ah sorry for the delay. I just trust Adrian's resolution:)

This revision is now accepted and ready to land.May 12 2020, 9:17 AM
MaskRay added inline comments.May 12 2020, 9:18 AM
clang/lib/Lex/PPMacroExpansion.cpp
1545

Does this need to be swapped?

saudi marked an inline comment as done.May 12 2020, 11:31 AM
saudi added inline comments.
clang/lib/Lex/PPMacroExpansion.cpp
1545

Lexer::Stringify doubles backslashes in Windows-style paths.

So, without this patch, using the -fmacro-prefix-map=[X]:[Y] requires doubling backslashes for [X], whereas -fdebug-prefix-map= requires single backslashes.

Also, in the tests : %clang -E -fmacro-prefix-map=%p=A:\UNLIKELY_PATH=empty would never match %p prefix on Windows platform, as %p contains single blackslashes.