This function can be used to guess separator style from an absolute
path. This is equivalent to lldb's GuessPathStyle. The primary use
case is DWARFDebugLine::Prologue::getFileNameByIndex where instead
of using native style, which may not match the style of the platform
where the debug info was generated, we should instead try to guess
the style from the path itself and then use it if at all possible.
There are other places in LLVM we implement this functionality and
could all use this function instead.
Details
- Reviewers
- dblaikie 
Diff Detail
Event Timeline
I had to revert D87657 because it broke several tests: the normalization resulted in slashes being converted to native format, but that's incorrect. I'd like to use this function to instead determine the style from the path inside the debug info.
Hmm - probably simpler to justify this patch by including refactoring at least one (possibly more) existing uses you mentioned, to use this API - then I probably don't really need to understand what it's doing, just that it's generalizing some existing functionality.
But if I am going to understand what this is doing - could you explain in a bit more detail (perhaps with examples) what went wrong with the other review/patch and how this'll address it? I wasn't quite able to follow the patch description, sorry :/
Done.
But if I am going to understand what this is doing - could you explain in a bit more detail (perhaps with examples) what went wrong with the other review/patch and how this'll address it? I wasn't quite able to follow the patch description, sorry :/
The problem is that GetFileByIndex has been using llvm::sys::path::native as the separator style which is the separator of the platform we're running on, but not the platform where the debug info was generated. So for example, say I produce an ELF file with debug info on Linux, but then try to use llvm-dwarfdump with that file on Windows. Today, I'd get /home/username/source\file.c as the path, assuming /home/username/source being comp_dir and file.c being the filename. That's because GetFileByIndex uses sys::path::append with llvm::sys::path::native as a style which on Windows would result in \ being used as separator instead of /. That's why tests like https://github.com/llvm/llvm-project/blob/c913f6dce69513b430f705d5a1f4e745f5d0a27e/lld/test/wasm/debuginfo.test#L19 use regex to match both separators for the last component. With D87657 this becomes worse because llvm::sys::path:remove_dots rewrites all separators as part of normalization, so you end up with /home\username\source\file.c. Rather than updating all tests to accepts both separators for all components, I made this change which tries to infer the correct separator which I think is a better solution overall: if I have a file produced on Linux, even if I'm reading it on Windows, I'd still expect / to be used as path separators. Does that make sense?
Potential alternative would be to always use llvm::sys::path::posix everywhere since on modern Windows, all syscalls and tools should accept / as a path separator, but I'm worried that's going to be a more intrusive change.
I'm not convinced this is really correct. After all, a path with mixed separators (e.g. '/my/path\to\foo') could be a Windows path rooted at the current drive. This mixed style can sometimes happen when things get concatenated together, so I don't think it's compeletely unreasonable.
I'm also concerned that if we guess wrong, the behaviour will end up breaking. For example, if the path were a Windows path "/my/path\../to/foo", I'd expect a remove_dots call (without worrying about slash normalisation) to result in "/my/to/foo", but if the code is using Posix style, will it actually result in the path being left unchanged?
I can think of two competing alternatives:
- add an option to remove_dots to not do the separator canonicalization. I'm not sure whether this can be done unambiguously however - which separator should be removed when a directory is removed due to dots? The one before the removed parts? After them?
- change all '\' to '/' unconditionally. However, this might break Linux paths with '\' in.
None of these solutions is perfect and I'm not even sure if there's one. Even if we improve remove_dots, there's still the existing issue with sys::path::append: which separator should we use when combining compilation dir, include dir and the file name.
Maybe we should instead update Clang to do the normalization when generating debug info and always use /, and then always use / unconditionally when combining paths on the consumer side (but don't do any normalization)?
You're probably right about no solution being perfect. I'm not a fan of using '/' on Windows, mostly because it looks weird, but also because at least in the past for Windows cmd, '/' paths didn't auto-complete properly. It's only a minor gripe though (and not really an issue nowadays since I mostly use PowerShell). Normalisation to at least remove dots would make a lot of sense (strings become shorter), and possibly even normalising the separator direction does too, since that would make it more likely paths end up being identical and therefore poolable in .debug_str/.debug_line_str.
sys::path::append could probably unconditionally be Posix, since such paths will also be valid Windows paths always. Maybe we need some way of communicating target platform as an alternative to native style? If that were a command-line option, the caller at least would know which style it needs to work with.
I'd like to hear others opinions - @dblaikie?
Not sure I have great enlightenment here. @rnk might have some opinions on Windows V Posix path separators.
I think it's probably good to do path normalization up in clang where it knows about real files and paths and what's what - if possible (well, maybe with line directives it doesn't know? I think it could be given arbitrary strings there, really). Rather than trying to guess about path names later.
Are there accessible /bugs/ in any of this? (cases where we currently (or with this patch) can produce incorrect behavior, not just somewhat less than ideal dumps? The PDB use of this string seems like it goes into the PDB debug info and is used by a debugger - but I guess in that case you're dealing with Windows where both path separators are acceptable, so no bugs - and is there no POSIX equivalent situation where guessing is done and a wrong guess results in a backslash that would make the path buggy/incorrect/)
A big portability problem with the virtual file system (VFS) is that we now have paths that can legitimately be in a hybrid Windows/Posix style. Be aware that some of the "guess-the-style" instances were modified to account for this.
It's _not_ true that you can _always_ substitute / for \ in Windows paths. It's true that most of the Win32 file handling APIs will accommodate non-native separators. But Windows has multiple varieties of file paths, like UNC paths, where the backslashes are meaningful. To bypass other Win32 API restrictions (like a relatively small MAX_PATH), many APIs now support a special prefix that tells the API to treat a path as an opaque sequence of wide chars that are passed down to the file system without alteration. Those, too need to preserve blackslashes. Finally, when file paths are presented in the UI, having the wrong separators hinders copy and pasting those paths into some common command line tools.
It's true that LLVM doesn't currently handle most of these cases where preserving the backslashes is important. But the more we try to shoehorn all paths into the Posix model, the harder it will be to improve portability later. I don't envy the VMS folks, where a proper path name can include a colon, square brackets, dot separators, and a semicolon.
I'd prefer it if we didn't have to guess the path style in the first place. Initially I thought the target triple might be able to help us out, but it doesn't, because really you want to know the path style of the file system in use where compilation took place. Normally, clang is the one reading the source files, so whatever clang's host's style is is the right style. These test cases are getting tripped up because compilation is split over two host OSs: Clang's OS and llc's OS. Now I wonder if we shouldn't just persist the path style through the IR module, or even through the assembler if DWARF needs to join paths after assembly.
Separately, I would like LLVM to move in the direction of standardizing on forward slashes in internal representations and data structures. We already convert long paths to UNC style before we call FS APIs on Windows. That's a good point to rewrite to backslashes if we need to. I don't believe any information or capabilities are lost: NTFS filenames may not contain / characters. Even if we receive a funky UNC-style path from a user, we can change the slash direction internally, and no information will be lost.
If we move that direction, we'll have to re-think VFS again. I'll also lobby hard to make sure that we show the native representation in compiler output, like diagnostic messages.
We already convert long paths to UNC style before we call FS APIs on Windows. That's a good point to rewrite to backslashes if we need to. I don't believe any information or capabilities are lost: NTFS filenames may not contain / characters. Even if we receive a funky UNC-style path from a user, we can change the slash direction internally, and no information will be lost.
Technical nit: The \\?\ prefix to bypass path parsing and access the NTFS namespace directly looks similar to a UNC path, but it's not a UNC path.
I believe NTFS paths can contain almost any Unicode character, including ones the Win32 conventions choose to treat as reserved. It simply becomes difficult to work with such outliers, but you can do almost anything with CreateFile and the \\?\ escape prefix. You can even name a file ...
The Win32 namespace is built on top of the NT namespace, using symlinks to map Win32 naming conventions to devices, and some NT namespace device nodes are simply parents for the underlying filesystem (like NTFS).
Is x a relative file name in Windows? It could be. But Windows APIs will treat it like a drive letter because it's only one character. If the current path for the X: drive is the root directory, then x is effectively an absolute path. That's why, to access a single-letter file name, you often have to type .\x or x. or even x::$DATA. It's a real mess. Similar surprises (and workarounds) await those who name their file like a DOS-style device name, like AUX, CON, PR, NUL or COM1.
My point is, there are a lot pitfalls here. Some we handle, many we don't. But the more we try to canonicalize paths into a non-native style, the more traps we might encounter and the harder it will be to fix them.
There's also a case where the opposite is wanted; when operating in mingw mode, there's a predecent of GCC outputting paths with forward slashes. This is needed for cases where e.g. libtool parses the output of e.g. $CC -v for the linker command/options, and the output of $CC --print-search-dirs. (libtool is largely implemented in shellscript, tolerating backslashes would require lots of changes to how things are quoted/escaped, and libtool is essentially unmaintained at this point...) See https://reviews.llvm.org/D53066 for an earlier attempt at fixing that, which I had to revert, because I attempted it at the wrong level and it turned out to be much more of a mess than I expected...
Yes, we might have to adjust VFS usage, but most of the important users are in-tree, so this is doable, not something unfixable, like external libraries that won't accept forward slashes. And yes, I think diagnostics seem like a reasonable place to canonicalize path names for the user to whatever their preference is, but as Martin points out, the user's preference may be different depending on the environment.
We already convert long paths to UNC style before we call FS APIs on Windows. That's a good point to rewrite to backslashes if we need to. I don't believe any information or capabilities are lost: NTFS filenames may not contain / characters. Even if we receive a funky UNC-style path from a user, we can change the slash direction internally, and no information will be lost.
Technical nit: The \\?\ prefix to bypass path parsing and access the NTFS namespace directly looks similar to a UNC path, but it's not a UNC path.
I believe NTFS paths can contain almost any Unicode character, including ones the Win32 conventions choose to treat as reserved. It simply becomes difficult to work with such outliers, but you can do almost anything with CreateFile and the \\?\ escape prefix. You can even name a file ...
This is similar to the way in which NTFS file names are not really UTF-16, they are UCS-2, and you can jam pretty much any 16-bit integer (maybe not 0?) into the file name if you like. It may not be possible to re-encode all filenames as UTF-8, but LLVM does it anyway, and nobody has complained about it yet, to my knowledge.
The Win32 namespace is built on top of the NT namespace, using symlinks to map Win32 naming conventions to devices, and some NT namespace device nodes are simply parents for the underlying filesystem (like NTFS).
Is x a relative file name in Windows? It could be. But Windows APIs will treat it like a drive letter because it's only one character. If the current path for the X: drive is the root directory, then x is effectively an absolute path. That's why, to access a single-letter file name, you often have to type .\x or x. or even x::$DATA. It's a real mess. Similar surprises (and workarounds) await those who name their file like a DOS-style device name, like AUX, CON, PR, NUL or COM1.
Maybe this is my non-Windows native-ness showing, but I think in all these cases, it would be preferable to treat all of these as vanilla, relative filenames, with the possible exception of NUL. I'm not sure I want clang ... -o COM1 to try to write to a DOS device. It limits LLVM tools to the common denominator of FS operations shared between Posix and Windows, but maybe we can accept those limitations.
My point is, there are a lot pitfalls here. Some we handle, many we don't. But the more we try to canonicalize paths into a non-native style, the more traps we might encounter and the harder it will be to fix them.
That's true, but I think we stand to gain a lot by making it easier to write LLVM code and tests that are portable by default. Consider our UTF-8 internal string representation: I consider this decision to have resulted in a huge win for simplicity. LLVM may have too many string types, but we'd have twice as many if we had Unicode variants of them all. :)
Yes, but many of our uses today are to allow for portability: they allow us to write tests using Posix-style paths that work on both Posix and Windows. I believe canonicalization would require some of those tests to have separate Posix and Windows versions.
like external libraries that won't accept forward slashes. And yes, I think diagnostics seem like a reasonable place to canonicalize path names for the user to whatever their preference is, but as Martin points out, the user's preference may be different depending on the environment.
We already convert long paths to UNC style before we call FS APIs on Windows. That's a good point to rewrite to backslashes if we need to. I don't believe any information or capabilities are lost: NTFS filenames may not contain / characters. Even if we receive a funky UNC-style path from a user, we can change the slash direction internally, and no information will be lost.
Technical nit: The \\?\ prefix to bypass path parsing and access the NTFS namespace directly looks similar to a UNC path, but it's not a UNC path.
I believe NTFS paths can contain almost any Unicode character, including ones the Win32 conventions choose to treat as reserved. It simply becomes difficult to work with such outliers, but you can do almost anything with CreateFile and the \\?\ escape prefix. You can even name a file ...
This is similar to the way in which NTFS file names are not really UTF-16, they are UCS-2, and you can jam pretty much any 16-bit integer (maybe not 0?) into the file name if you like. It may not be possible to re-encode all filenames as UTF-8, but LLVM does it anyway, and nobody has complained about it yet, to my knowledge.
I have no objection to using UTF-8 internally, but that doesn't seem like the same issue to me. Conventions like the . and .. directories are implemented at a different layer, a layer which sometimes you _must_ bypass.
The Win32 namespace is built on top of the NT namespace, using symlinks to map Win32 naming conventions to devices, and some NT namespace device nodes are simply parents for the underlying filesystem (like NTFS).
Is x a relative file name in Windows? It could be. But Windows APIs will treat it like a drive letter because it's only one character. If the current path for the X: drive is the root directory, then x is effectively an absolute path. That's why, to access a single-letter file name, you often have to type .\x or x. or even x::$DATA. It's a real mess. Similar surprises (and workarounds) await those who name their file like a DOS-style device name, like AUX, CON, PR, NUL or COM1.
Maybe this is my non-Windows native-ness showing, but I think in all these cases, it would be preferable to treat all of these as vanilla, relative filenames, with the possible exception of NUL. I'm not sure I want clang ... -o COM1 to try to write to a DOS device. It limits LLVM tools to the common denominator of FS operations shared between Posix and Windows, but maybe we can accept those limitations.
I think you're missing my point. If we want names like x and com1 to behave like plain old file names, then we cannot just treat them as such. We have to recognize that they're special and escape them to ensure they work that way. Otherwise, clang ... -o COM1 will indeed try to open the first serial port.
My point is, there are a lot pitfalls here. Some we handle, many we don't. But the more we try to canonicalize paths into a non-native style, the more traps we might encounter and the harder it will be to fix them.
That's true, but I think we stand to gain a lot by making it easier to write LLVM code and tests that are portable by default. Consider our UTF-8 internal string representation: I consider this decision to have resulted in a huge win for simplicity. LLVM may have too many string types, but we'd have twice as many if we had Unicode variants of them all. :)
As for portable tests, I'd like to point out that VFS, which allows the hybrid Windows- and Posix-style paths, is currently a technique for writing portable tests. If we canonicalize, some of those tests will have to be split into separate Windows and Posix versions.
I agree with the UTF-8 choice, but I don't think that's an analogous example. Ignoring less common styles of Windows path syntax and failing to "escape" the special names is a different beast. Not only will it not do what anyone wants or expects, canonicalization can cut users off from reasonable workarounds. For example, if I know that x will be treated like a drive letter but I want it as a file name, how can I express that in a way the canonicalization doesn't strip away?