Page MenuHomePhabricator

[lldb] Unicode support on Win32

Authored by cameron314 on Feb 10 2016, 3:24 PM.



Various fixes for Win32 interop, particularly with regard to paths; paths containing non-ASCII characters now work properly (they are converted to and from UTF-8 internally).

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I've addressed more feedback. I'm working on a second version of the patch with a lot less #ifdefs and nicer wrappers (and better error handling where possible).

171 ↗(On Diff #47539)

I understand. I'll try to refactor out the ifdefs.

140 ↗(On Diff #47539)

Hmm. Alright then. I'll put the wrappers in LLDB for now.

881 ↗(On Diff #47539)

I don't trust myself to make this change without introducing a bug. But I'll get rid of the #ifdef.

353 ↗(On Diff #47539)

...which turns out to exist here too. I call that instead of using stat now, much cleaner!

108 ↗(On Diff #47539)

Makes sense. Done.

237 ↗(On Diff #47539)

Actually, it seems the stat is just done to see if the path exists. llvm::sys::fs::exists does this already, I'll call that instead. It does the proper string conversion and uses the right Win32 API call on Windows.

829 ↗(On Diff #47539)

By the way, all the FileSystem methods seem to accept FileSpecs as input, not raw paths. I agree it would be nice if FileSystem abstracted the OS-level FS, which all the other higher-level parts (FileSpec, File) then depended on. But that would require changing the API rather significantly.

34 ↗(On Diff #47539)

They would, but these are called from functions that are trying to emulate libc functions, by avoiding calling malloc among other things. Hence these wrappers that work with buffers.

210 ↗(On Diff #47539)

Not sure I understand. That's what this code does? There's a tad of extra logic needed because the path passed in is allowed to be NULL, indicating we should allocate a buffer for the caller.

1175 ↗(On Diff #47539)

There is, but the mode is an enum and not a string. I'll get rid of the #ifdef, though.

457 ↗(On Diff #47539)

Quite so. I'll add a portable getenv wrapper in HostInfo.

zturner added inline comments.Feb 22 2016, 3:24 PM
881 ↗(On Diff #47539)

One possibility is to revert this portion of the patch (along with any other changes that you don't feel comfortable making) back to their original state. This will still leave certain parts of unicode functionality broken, but it would reduce the surface area of "things that might be wrong with the patch" and give people a chance to make sure nothing else is seriously broken. Then, we can do these other "more difficult" locations as a followup patch. The nice thing about that is that if it does end up breaking something, and we have to revert it, it doesn't revert 100% of the work towards getting Unicode working, but only a couple of points.

237 ↗(On Diff #47539)

Even better :)

829 ↗(On Diff #47539)

Yea, it's a bit unfortunate. In reality there should be overloads that take FileSpecs and std::strings or llvm::StringRefs, and then only 1 implementation that the others re-use. But right now there aren't.

210 ↗(On Diff #47539)

Ahh that extra logic is what I was referring to. I wasn't aware of those semantics of getcwd. Can you add a comment explaining that, since it's not obvious from looking at the code.

1175 ↗(On Diff #47539)

We have a function somewhere that converts a mode string to the enum. I think it's in File.h or File.cpp. Let me know if you can't find it.

cameron314 added inline comments.Feb 22 2016, 3:39 PM
1289 ↗(On Diff #47539)

Hmm, turns out it does. I'll add code to revert it at the end.

amccarth added inline comments.Feb 22 2016, 3:47 PM
1179 ↗(On Diff #47539)

Ah, I see. I work in another code base where PATH_MAX is synonymous with MAX_PATH, so I was confused.

Buffers of 32K characters on the stack seem like a bad idea. We need a vector or string or some other container that puts the bulk of the data somewhere other than the stack.

cameron314 added inline comments.Feb 22 2016, 4:01 PM
881 ↗(On Diff #47539)

This makes sense. I'm almost done with the next version of the patch, we'll see what should make the cut at that point.

1179 ↗(On Diff #47539)

Right. There's currently ~110 places in LLDB that allocate buffers of size PATH_MAX on the stack. Nearly all of those predate my patch, it seems.

210 ↗(On Diff #47539)

Haha, sure. I wasn't aware either, but it blew up at runtime since it turns out LLDB actually makes use of this when booting with a relative path.

1175 ↗(On Diff #47539)

You know, it turns out there's one right in this very file. Thanks!

amccarth added inline comments.Feb 22 2016, 4:04 PM
1179 ↗(On Diff #47539)

Right, but this is a new case because you changed it from MAX_PATH (260) to PATH_MAX (32K). I'd rather not introduce more instances like this.

cameron314 added inline comments.Feb 22 2016, 5:42 PM
1179 ↗(On Diff #47539)

Good point, I'll fix it up :-)

cameron314 updated this revision to Diff 49066.Feb 25 2016, 8:25 AM
cameron314 edited edge metadata.
cameron314 removed rL LLVM as the repository for this revision.

Here's a new version of the patch which takes into account most of the feedback so far (less #ifdefs, etc.). It depends on my pending patch in LLVM (D17549) that introduces a few more helper wrappers for UTF conversion.

cameron314 added inline comments.Feb 25 2016, 8:28 AM
618 ↗(On Diff #49066)

Unrelated to this patch, but can this method be called from several threads at once?

I'm growing more comfortable with these changes, but I'll defer to Zach.

242 ↗(On Diff #49066)

I recognize that you're just repeating the pattern from above, but ...

This seems whacky and dangerous. It appears the attempt is to put a null terminator on the end, but not count it in the length of the vector. And I guess that we know it's safe here because path is an llvm::SmallVector, so the implementation details are known. But, ugh. If path were ever changed to std::vector, we'd no longer have assurance that the terminator remains after the pop.

231 ↗(On Diff #49066)

I agree with Zach that a dynamic solution here is better. It's already icky that we have a 32KB stack buffer here. Turning it into a 64KB +1B stack buffer seem egregious.

cameron314 added inline comments.Feb 26 2016, 7:32 AM
242 ↗(On Diff #49066)

I totally agree. Took me a few minutes before I figured out what it was doing the first time I saw it :-) This is also done in the existing convertUTF8ToUTF16String wrapper too.

All the implementations of std::vector that I know of will work with this, though as you say it's not guaranteed by the standard.

Looking more closely, I think in this case it was only done for the call to stat, which I've removed. I had added it here too in case the caller relied on this null trick, but I don't think it's necessary in either place anymore. I'll remove it.

231 ↗(On Diff #49066)

Hah, yes. I'll replace this with a vector to start with.

cameron314 updated this revision to Diff 49185.Feb 26 2016, 7:36 AM
cameron314 edited edge metadata.

I don't want to be pushy, but is there a chance of this patch being accepted soon?
It does depend on D17549 in LLVM which is also still pending, but that one's at least very a simple patch adding the support functions that this one uses.

zturner added a subscriber: zturner.Mar 7 2016, 8:07 AM

Sorry slipped under my radar, I'll look again today

No worries, thanks for taking a look!

Can you rebase against tip of trunk? I want to apply and run the test suite with the patch applied, but I'm getting a lot of errors. After uploading a new rebased patch, can you also respond with the revision you rebased against so I can make sure I'm at the same place.

330 ↗(On Diff #49185)

Any particular reason you're using _SH_DENYWR instead of _SH_DENYNO? No matter what we do we will always have subtle differences in semantics, but _SH_DENYNO is closer to posix semantics.

Absolutely, I'm rebasing now. This is unfortunately the type of patch that rots really quickly ;-)

I did run check-lldb locally at one point, with no major differences before and after the patch. But I recently discovered the version of Python I'd compiled had accidentally been compiled with VS2008, and since the tip's changed anyway, it's absolutely worth running them again. This was the day before the instructions to use VS21015 with Python 3 were posted, haha.

330 ↗(On Diff #49185)

No particular reason other than I find _SH_DENYWR a reasonable default. It generally doesn't hurt to let others read the file while it's being written, and that's sometimes crucial (e.g. for log files which are written to incrementally and only closed at the very end). I can change it if you prefer.

zturner added inline comments.Mar 7 2016, 11:45 AM
330 ↗(On Diff #49185)

I like deny write in principle as well, but I kind of do prefer to change it if for no other reason than it matches the semantics on Posix more closely. We've had sharing access denied errors in the past on Windows that we traced back to sharing violations, and it's easier if all platforms just use the same semantics.

cameron314 added inline comments.Mar 7 2016, 12:01 PM
330 ↗(On Diff #49185)

OK, I'll change it for the next patch.

cameron314 updated this revision to Diff 49986.Mar 7 2016, 4:16 PM

Sorry for the delay, I wanted to make sure LLDB compiled without warnings after I rebased the patch. It applies cleanly on the tip as of this writing, and is based on rL262819.

I'm waiting for the LLVM side change to go in and this patch rebased on top of that before I test it out.

In the meantime, one thing I'm concerned about is setting the codepage of the console. You determined earlier that it affects the state of the console even after exiting LLDB. So we clean it up on shutdown. But what if LLDB crashes? I'm not sure if this is a concern, but it's troublesome at the very least.

If I remember correctly, this patch was originally intended to support your use case of linking against liblldb but not using the lldb.exe driver, is that right? Or maybe you were using the python extension module, I don't remember. In any case, neither of those use cases requires you to display UTF8 text to the console window. So I wonder if it is possible to do without the codepage portion of the patch.

I would like to avoid that change if at all possible, but if it is really necessary, I would at least like to do it as a separate patch, so that if need be we can revert it later without reverting the rest of the changes here.

Yes, if lldb.exe crashes or otherwise exits without returning to main, then the codepage will stay set in the current console. The latter can be fixed with a static object destructor, but not the former. I don't think this would be a practical problem, since very few console applications depend on the codepage in the first place (those that do tend to set it manually when they start). The setting isn't permanent, either, it's only for the current console window until it closes (or is reset by somebody else). And it doesn't affect the bytes input/output by any programs, only the way the console interprets those bytes. I'd say the impact of changing the codepage without changing it back is very low.

That part of the patch is not quite as important -- without it UTF-8 I/O is broken, but it's already broken anyway unless the user has taken explicit steps to reconfigure Window's console system to support Unicode character display (which requires fiddling in the registry and rebooting). No one I know has done this. I can remove it from the patch if you think it's too risky.

We're using both liblldb and lldb.exe, though mostly liblldb. No Python bindings as of now.

The patch for LLVM has been accepted, I'm waiting for someone to commit it when they have a moment. I'll rebase this one on the tip again afterwards.

Sounds good. I will test this out once that patch goes in. And yea, I
would prefer to remove the codepage stuff for now, and we can re-consider
adding it back if/when someone actually needs it. Then we can discuss some
other options like a preprocessor define that enables setting the codepage,
or a command line switch, etc.

Here we go! D17549 has been committed, so I've rebased this patch onto that latest version (r263233).

Oops, was missing a slash in the dst-prefix of the previous patch update.

... and had a slash too many at the start of the path prefixes. Sorry for the spam.

@zturner: Let me know when you're ready for this patch and I'll rebase it again, since there's been quite a few more commits.

I can look at it today. Sorry again for the delay, rebase 1 more time and
I'll check it out today

Oh yea I think the reason I didn't look at it again is because I was
waiting for the update where you removed the codepage stuff from the driver.

No worries. I removed the codepage stuff when I did the last rebase. Another one coming up shortly!

cameron314 updated this revision to Diff 50976.Mar 17 2016, 2:30 PM

Here we go! Rebased to tip (rL263735).

I'm not sure what your source tree layout looks like, but this isn't applying for me. All your paths have "trunk" in front of them, which is a little unusual in the directory structure is supposed to be something like this:


And I would apply the patch inside of the lldb directory. But yours appears to look like this:


So it doesn't apply.

Ah. Sorry. I have the same source layout as you locally, but I changed the paths in the patch to match the layout of Diffusion (rL). I'll redo the patch with relative paths from lldb...

cameron314 updated this revision to Diff 50979.Mar 17 2016, 2:45 PM

I think we're in different time zones -- I'm heading home for the day, but I'll look back in the morning to see if there's anything else that needs addressing.

I'm getting this when linking:

[826/826] Linking CXX executable bin\lldb.exe
FAILED: cmd.exe /C "cd . && "C:\Program Files (x86)\CMake\bin\cmake.exe" -E vs_link_exe --intdir=tools\lldb\tools\driver\CMakeFiles\lldb.dir --manifests  -- C:\PROGRA~2\MICROS~3.0\VC\bin\AMD64_~2\link.exe /nologo @CMakeFiles/lldb.rsp  /out:bin\lldb.exe /implib:lib\lldb.lib /pdb:bin\lldb.pdb /version:3.9  /machine:X86 /STACK:10000000 /debug /INCREMENTAL /subsystem:console  && cd ."
ucrtd.lib(ucrtbased.dll) : error LNK2005: _signal already defined in Platform.cpp.obj
   Creating library lib\lldb.lib and object lib\lldb.exp
bin\lldb.exe : fatal error LNK1169: one or more multiply defined symbols found
LINK Pass 1 failed. with 1169
[826/826] Linking CXX executable bin\lldb-mi.exe
FAILED: cmd.exe /C "cd . && "C:\Program Files (x86)\CMake\bin\cmake.exe" -E vs_link_exe --intdir=tools\lldb\tools\lldb-mi\CMakeFiles\lldb-mi.dir --manifests  -- C:\PROGRA~2\MICROS~3.0\VC\bin\AMD64_~2\link.exe /nologo @CMakeFiles/lldb-mi.rsp  /out:bin\lldb-mi.exe /implib:lib\lldb-mi.lib /pdb:bin\lldb-mi.pdb /version:3.9  /machine:X86 /STACK:10000000 /debug /INCREMENTAL /subsystem:console  && cd ."
ucrtd.lib(ucrtbased.dll) : error LNK2005: _signal already defined in Platform.cpp.obj
   Creating library lib\lldb-mi.lib and object lib\lldb-mi.exp
bin\lldb-mi.exe : fatal error LNK1169: one or more multiply defined symbols found
LINK Pass 1 failed. with 1169
ninja: build stopped: subcommand failed.

I assume it must be related to the fact that we're now defining UNICODE and _UNICODE, but I need to run and don't have enough time to figure out why.

I've always disliked the fact that we redefine the signal function which is already builtin in Windows anyway. I actually feel like we need to introduce the function sighandler_t Host::Signal(int sig, sighandler_t sigFunc), and on non-windows have this function call signal, and on Windows use our own custom implementation.

Let me know if you have any better ideas.

Hmm, that doesn't seem good. Locally it links for me (I get compile time warnings about the signal stuff, though).
It looks like the WinSDK signal.h is being pulled in despite _INC_SIGNAL being defined (note that there's no include guard in that header -- it uses #pragma once instead). And the signal handler callback has a different return type. I don't see anything related to Unicode.

Here's the tail of my build output for lldb-mi.exe. I'm using VS2015 on Windows 7 x64, compiling in release with ninja:

[77/78] 13.653s Building CXX object tools\lldb\tools\lldb-mi\CMakeFiles\lldb-mi.dir\Platform.cpp.obj
d:\dev\llvm\tools\lldb\tools\lldb-mi\Platform.h(84): warning C4005: 'SIG_DFL': macro redefinition
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\signal.h(35): note: see previous definition of 'SIG_DFL'
d:\dev\llvm\tools\lldb\tools\lldb-mi\Platform.h(85): warning C4005: 'SIG_IGN': macro redefinition
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\signal.h(36): note: see previous definition of 'SIG_IGN'
d:\dev\llvm\tools\lldb\tools\lldb-mi\Platform.h(87): warning C4273: 'signal': inconsistent dll linkage
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\signal.h(57): note: see previous definition of 'signal'

[78/78] 14.332s Linking CXX executable bin\lldb-mi.exe
   Creating library lib\lldb-mi.lib and object lib\lldb-mi.exp

The reason I mentioned UNICODE is because it's one of the main differences between pre-patch and post-patch. It works for me pre-patch. What command line are you passing to CMake? Are you using MSVC 2015 Update 1 or initial release?

The other difference is that I'm compiling in Debug and you said you were compiling release. Also I'm doing an x86 build of LLDB. Can you try x86 debug build with MSVC 2015 Update 1 and see if it still succeeds for you?

Since it works without my patch, you're probably right about it being related to the UNICODE define. All the other changes are completely removed from that part of the code. But I still don't see how it could affect that.

Here's the script I use to run cmake:

set PATH=%ProgramFiles%\Ninja;%ProgramFiles(x86)%\CMake\bin;%SYSTEMROOT%\system32;%SYSTEMROOT%;%SWIGDIR% 
call "%VS140COMNTOOLS%\vsvars32.bat" 
mkdir VS2015Debug 
cd VS2015Debug 
cd .. 
mkdir VS2015Release 
cd VS2015Release 
cd .. 

This is with VS2015 update 1. And it's already the x86 version, the patch has diverged significantly from what we have locally (which is based on a snapshot from several weeks ago) so I've been using the tip as-is :-)

I'm trying a fresh Debug build from scratch (deleting and re-creating the cmake build folder). I'll report back when it's done.

I'm using the amd64_x86 toolchain. They're supposed to be identical so
that's unlikely to be the problem, but it's the only difference i can see.
Let me know what happens on your clean rebuild

Aha, I get the same error now:

fatal error LNK1169: one or more multiply defined symbols found

I'm looking into it!

Patch doesn't appear to be clang-formatted. All patches need to be run through clang-format before submitting.

I think this is my last set of comments. If there's any you disagree with let me know. If you agree with everything, I can probably just make the changes on my end since they are mostly mechanical, and then submit.


This is using a tab, but it should be 2 spaces.


nullptr instead of NULL


Here we are doing some stuff with _USE_32BIT_TIME_T, but we reproduce almost this exact same logic in File::GetPermissions() except without the special case for _USE_32BIT_TIME_T. I think we should introduce a function called FileSystem::Stat() in very much the same way that you've done FileSystem::Fopen(). And then we put the implementation there so that we can have it in one place and make sure it's always going to be consistent and correct.


Should we be using FileSystem::Fopen() here?


Same here. FileSystem::Fopen()

And btw, the multiply defined symbol error is gone now that the custom signal function is removed

I'll run clang-format and submit a new patch. Thanks for fixing that duplicate symbol with the signal stuff!


Yeah, I agree this should really go in FileSystem.


Since this is MSVC specific code already, I think it's OK. The _SH_DENYWR seems important here.


Absolutely. I missed this one.

cameron314 updated this revision to Diff 51231.Mar 21 2016, 2:30 PM

Here's the latest patch, now with 100% more autoformatting and a centralized FileSystem::Stat function.
I didn't rebase, but it should still apply cleanly.

zturner accepted this revision.Mar 22 2016, 10:59 AM
zturner edited edge metadata.

I think this is ready to go in. You probably noticed that this took a long time to work out all the issues with. In the future if there's any way you could break things up into smaller patches, you will probably end up getting lots of small patches in in a shorter total time than one large patch. Even if it just means fixing all the Unicode issues in file X, and then in file Y in a separate patch, etc.

Anyway, thanks for seeing this through to the end!

This revision is now accepted and ready to land.Mar 22 2016, 10:59 AM

Excellent, thank you!
Fortunately, almost all the other changes I've made to llvm/clang/lldb locally are fairly small patches.

cameron314 closed this revision.Mar 22 2016, 3:04 PM

Closing now that the patch has been committed by rL264074.
(Not sure why it didn't auto-link.)

cameron314 added inline comments.Mar 22 2016, 4:26 PM

Seems this breaks the LLDB build on OS X. Sean has reverted this part, but it should be using the #ifdef that was there in the earlier version of the patch:

#ifdef _WIN32
    std::wstring path;
    if (!llvm::ConvertUTF8toWide(vFileNamePath.c_str(), path))
        return false;
    pTmp = ::_wfopen(path.c_str(), L"wb");
    pTmp = ::fopen(vFileNamePath.c_str(), "wb");